Technical Note - Grid Computing: ECS Grid

This document and the underlying grid software changed in April 2012

As of December 2012, machines in the grid started to move to a 64-bit Operating System.

The default version of Java on the School's machines changed again for 2015.

Please READ IT ALL AGAIN even if you have flown with us many times before.

IMPORTANT

This document and the underlying grid software changed in April 2012

Please READ IT ALL AGAIN even if you have flown with us many times before.

If you follow the instructions here and something does not work, please let the programing staff know

FOR EXISTING/RETURNING USERS

1) There are no longer any NetBSD machines in the ECS/SGE Grid.

As of April 2012, this document no longer makes reference to that platform except in the line above.

2) The default SGE_ARCH string is now lx-amd64 not lx26-x86, nor lx-x86.

3) The use of

/vol/grid/sgeusers

should be avoided in favour of

/vol/grid-solar/sgeusers

As of April 2012, this document no longer makes reference to the /vol/grid filesystem, and neither should you!

Summary

There didn't seem to be any ECS-focused documentation aimed at the users wishing to run jobs on the ECS Grid, so here are the basics of job submission, this being the area in which things differ most from the provider's documentation, which used to be at http://dlc.sun.com/pdf/820-0699/820-0699.pdf. Other aspects of job control are covered within that documentation.

Note: After Oracle bought Sun, they stopped providing the documentation at that link, so a local copy is attached.

Details

General

ECS administers two "Grids" known, in administration circles, as the ECS Grid and the SCS Grid. You may also find these referred to as the SGE Grid and the Condor Grid respectively. The two are seperate.

The Tech Note for the SCS Grid is at:

http://ecs.victoria.ac.nz/Support/TechNoteScsGrid

The ECS Grid runs under the control of a descendent of the Sun Grid Engine (SGE) and exists to make use of the computing power of the School's ArchLinux machines at times when they are unused, ie when they should have no one logged in at the console, eg overnight.

Jobs, usually shell scripts wrapping a number of tasks, are submitted from any ECS workstation (if your workstation is a windows machine or Mac, then you will need to login to one of the Schools servers) ,into a simple queuing system where they remain, in turn, until an "unused" machine is able to start them. (Note that being able to start them is not the same as being able to run them to completion).

At present, users at the console of a machine have priority over Grid jobs running on the same machine to the extent that a Grid job will be suspended upon a machine where there is console activity, so users submitting Grid jobs should be aware that there is no guaranteed run time for any given task. Basically, it'll finish when it finishes

There is a low volume mailing list which is used to inform users of the ECS grid of matters which may be of interest to them. You can subscribe your School or University email address here

http://ecs.victoria.ac.nz/mailman/listinfo.cgi/grid-users

Setting up the environment

A single SGE instance can control a number of "Grids". In order to provide the SGE utilities with information about which Grid the user wishes to run their job within, a couple of environmental variables need to be set up. This is achieved using the standard package system's need pkgname environment modifiying process.

We'll be using the SGE Grid (ECS also maintains the SCS Grid not accessible from here), so a simple

need sgegrid

suffices to set up the environment for job submission.

If ever, when trying to run an SGE command, you see this message

critical error: Please set the environment variable SGE_ROOT.

then the chances are that you have forgotten to type, or otherwise arranged to have run, the command:

need sgegrid

Do I have a home on the Grid?

This is slightly quirky and not initially intuitive.

Staff will not be able to access their home directories

Students will

A user of an SGE-controlled Grid might expect to find that their jobs start to execute from a home directory within the overall system, that home directory being accessible to all machines within that system and, for the case of a grid utilising their everyday machines, being their normal working directory after logging in to any of those machines interactively.

A simple qsub submission_script_name would then be enough to start the job off.

With the ECS Grid, however, despite all staff and students having a home directory no matter which ECS machine they might login at, this is only the case for student accounts.

Because the machines comprising the Grid system can be any of the School workstations, both individuals' office machines and public access lab machines, staff will not see their home directories accessible from a remote machine when running a Grid job and so will have to explicitly set an initial working directory elsewhere.

This can be achieved on the command line at submission time, by use of the -wd path option to qsub though perhaps a better option for staff is to always place the equivalent SGE directive at the top of the submission script to a known path.

#$ -wd path

Of course, non-staff users may also find this mechanism useful.

At the time of this revision of this document, there is an area of accessible filestore that could be used to create a standard area from which staff could reliably start SGE jobs,

The /vol/grid-solar/ filesystem is a mount of some filestore supplied by ITS and users of the ECS grid will find that they have a personal directory within that filesystem, accessible as:

/vol/grid-solar/sgeusers/username

If you don't have a directory there you can ask for one to be created by emailing a request to the ECS request tracking system at jobs@ecs.vuw.ac.nz. The subject of the email should contain ECS Grid and in the text of your request you should tell us your ECS login name, which will be used as the name of the directory.

Note for Windows users

On a Windows system the /vol/grid-solar/ filesystem is accessible via mounting \\smb\grid-solar as a network drive.

Where will the input and output files be?

Because this is a distributed batch processing environment, there's usually no clear indication as to which machine(s) your job(s) will end up running on.

You thus need to give a little more thought to the location of input and output files than if you were simply running a job on your own workstation where everything is local to the machine (though, who knows, your Grid job might end up running on your workstation).

Once the job is running on the remote machine it will have access to many of the NFS shared filesystems that that the user would expect to see from their own workstation during an interactive session.

This can be useful when large data sets need to be accessible for reading and the overhead of copying the data to each machine upon which the job is running is large, because they can be placed at known paths.

The NFS-shared filesystems are less of an advantage for writing out data.

  1. Writing over NFS is often slower than reading
  2. There is the potential for bottlenecks to occur where a user has each job writing to the same directory (or even file, if they get things wrong) over NFS, or where many users each have jobs writing to the same NFS partition.

It is thus advisable to arrange for any output from the program to be written to a directory local to the machine upon which the program is running and then to copy any output to filestore to which the user will have general access, at the end of the job.

The area of filestore provided for this purpose upon every ECS Grid machine is the directory

/local/tmp

Note that this directory may well be being used by the user who normally sits at the console, and will almost certainly be used by Grid jobs that came before your current one and those that come after yours, and so there is no guarantee that a path or file name that you wish to create does not exist already.

To avoid any clashes; as a courtesy to other users, and to simplify the process of cleaning up afterwards, a directory below the path /local/tmp will be created that follows the convention

/local/tmp/[username]/$JOB_ID

where $JOB_ID is an environmental variable maintained by the SGE for the duration of the job which will thus be available to your submission script and to any programs able to read the environment.

Your submission scripts should ensure you either change to this directory or place any files you require for the job there.

Preserving results after execution

Once your job has run and the submission script has terminated, any output written to /local/tmp/[username]/$JOB_ID will be deleted.

In order to get your output back to somewhere more useful to you, there are a number of options:

  1. If you have direct access to your home directory path, you can copy directly to that. Staff can't do this
  2. If you have write access to a shared filesystem you can copy directly to that and then move files into your own filestore from your own machine

Staff will need to exercise Option 2.

Where do stdin, stdout and stderr appear

When one runs programs locally, program output and error messages will often appear on the console, or in the terminal emulator, and one can usually perform command line redrection for input.

When you are running a non-interactive job on a remote machine however, it is likely that you aren't going to see any console output during the execution of the program.

The SGE therefore redirects the stdout and stderr channels to file so that they may be inspected after the job has finished.

Typically the default stdout and stderr naming conventions try to create files called scriptname.o$JOB_ID and scriptname.e$JOB_ID respectively, in the working directory of the task when it starts. (See the note about working directories for staff)

The default location of these files can be altered by use of qsub command-line options or via the corresponding SGE directives being specified in the job submission script.

Job submission script example

Basic jobs just need to run on some machine within the Grid. If you know that you want a certain type of processor or need a minimum amount of memory or disk space, then you probably know enough to create the relevant submission script, or at least be capable of reading the Sun documentation for more information.

The job submission script freds_test.sh, used as an example here, is effectively then, just a simple test of the system. (though running a simple test to check that the basics, eg, directories existing, before you submit 5000 jobs that write to them is GOOD thing)

There are, however, a couple of advancements over a simple fire and forget activity.

We've taken the view that you'll want to know when your job starts and finishes and so the example job submission script will tell the Grid Engine to email you when it does.

We've taken the view that you'll want to script using the Bourne Shell so we'll force the SGE to run your job submission script within that shell, because the initial line's #!/bin/sh may not be honoured.

We've taken the view that you'll be doing something more than just adding a couple of integers together in a loop (and, no, adding a couple of thousand integers together still doesn't count) so we'll try and access some areas of the filestore you will have access to when you run your jobs and move things around.

Finally, we've written the example for an ECS user with the username fred and the mail address Fred.Bloggs@ecs.vuw.ac.nz, so you might need to change a few things in the script if you are not Fred Bloggs and/or not in ECS.

(HINT: Search for the strings fred, Fred.Bloggs, FRED and ecs).

BIGGER HINT ADDED

after someone tried to mail Fred.Bloggs@ecs.vuw.ac.nz and Fred was not happy getting all the emails.

Search for the strings fred, Fred.Bloggs, FRED and ecs and change to match your username, email address and school

And of course, this is just an example. Once you have modifed the scirpt to suit your needs and run a few tests to check things work as you expect then, you will probably want to remove some, if not all, of the recording of the environment and so on - but that's up to you.

Sometimes, having the info can be useful to a debugging exercise, eg when you are trying to invoke something not on the PATH because you are effectivley logging into a non-interactive environment: sometimes, all the extra clutter makes it hard to see what's happening.

That "extra clutter" should also include the individual job emails you will get, so it is usually worth electing not to be informed when running large numbers of jobs.

IMPORTANT

Some users, mostly users new to the School, have not been in the habit of understanding the template for grid job submission before using it as the basis for the own job submission scripts, and in other cases, have been using out-of-date and/or incorrect job submission scripts that they may have been given, or pointed to by existing users who should really know better.

When you come to modify this template for your own use, then EXCEPT for your username, you should not change ANYTHING in the first 32 lines, ie, above this line,

# Now we are in the job-specific directory so now can do something useful

These lines are our attempt to ensure that the grid is working as we expect it to, and there is no need to alter ANYTHING except your username.

Furthermore, even though they may appear to be so, an understanding of the process of grid job submisson would have alerted users to the fact that not all of the lines are comments and so can be removed so, once again, you should not change anything in those first 32 lines EXCEPT for your username.

If you have a copy of the template script downloaded and you do a 'diff' of it against the job submission script that you are considering using, the only differences you should see, within in the first 32 lines, should be of the following form

% diff tech_note_template.sh my_job_submission_script.sh
9c9
< #$ -wd /vol/grid-solar/sgeusers/fred 
---
> #$ -wd /vol/grid-solar/sgeusers/myusername 
18,19c18,19
< if [ -d /local/tmp/fred/$JOB_ID ]; then
<         cd /local/tmp/fred/$JOB_ID
---
> if [ -d /local/tmp/myusername/$JOB_ID ]; then
>         cd /local/tmp/myusername/$JOB_ID
26,27c26,27
<         echo "AND LOCAL TMP FRED "
<         ls -la /local/tmp/fred
---
>         echo "AND LOCAL TMP myusername "
>         ls -la /local/tmp/myusername
...

and if your 'diff' suggests otherwise, you should go back and alter your script so that it matches the template apart from the username differences.

A basic job submission script

Some people have had problems when they cut and paste the following text on windows systems.

To convert a file from windows format to UNIX format you can do the following at the command line

dos2unix < my_windows_file.txt  > my_unix_file.sh

Alternatively, if you use the emacs editor, you can change the charcater set encoding by typing this useful key sequence

Ctrl-x RET f undecided-unix RET

where RET is the return key.

The example script tries to convert a JPEG file into one with a PNG format. You may wish to generate, or obtain, a small JPEG file for use with the example.

Here is the basic script:

#!/bin/sh
#
# Force Bourne Shell if not Sun Grid Engine default shell (you never know!)
#
#$ -S /bin/sh
#
# I know I have a directory here so I'll use it as my initial working directory
#
#$ -wd /vol/grid-solar/sgeusers/fred 
#
# End of the setup directives
#
# Now let's do something useful, but first change into the job-specific directory that should
#  have been created for us
#
# Check we have somewhere to work now and if we don't, exit nicely.
#
if [ -d /local/tmp/fred/$JOB_ID ]; then
        cd /local/tmp/fred/$JOB_ID
else
        echo "Uh oh ! There's no job directory to change into "
        echo "Something is broken. I should inform the programmers"
        echo "Save some information that may be of use to them"
        echo "Here's LOCAL TMP "
        ls -la /local/tmp
        echo "AND LOCAL TMP FRED "
        ls -la /local/tmp/fred
        echo "Exiting"
        exit 1
fi
#
# Now we are in the job-specific directory so now can do something useful
#
# Stdout from programs and shell echos will go into the file
#    scriptname.o$JOB_ID
#  so we'll put a few things in there to help us see what went on
#
echo ==UNAME==
uname -n
echo ==WHO AM I and GROUPS==
id
groups
echo ==SGE_O_WORKDIR==
echo $SGE_O_WORKDIR
echo ==/LOCAL/TMP==
ls -ltr /local/tmp/
echo ==/VOL/GRID-SOLAR==
ls -l /vol/grid-solar/sgeusers/
#
# OK, where are we starting from and what's the environment we're in
#
echo ==RUN HOME==
pwd
ls
echo ==ENV==
env
echo ==SET==
set
#
echo == WHATS IN LOCAL/TMP ON THE MACHINE WE ARE RUNNING ON ==
ls -ltra /local/tmp | tail
#
echo == WHATS IN LOCAL TMP FRED JOB_ID AT THE START==
ls -la 
#
# Copy the input file to the local directory
#
cp /vol/grid-solar/sgeusers/fred/krb_tkt_flow.JPG .
echo ==WHATS THERE HAVING COPIED STUFF OVER AS INPUT==
ls -la 
# 
# Note that we need the full path to this utility, as it is not on the PATH
#
/usr/pkg/bin/convert krb_tkt_flow.JPG krb_tkt_flow.png
#
echo ==AND NOW, HAVING DONE SOMTHING USEFUL AND CREATED SOME OUTPUT==
ls -la
#
# Now we move the output to a place to pick it up from later
#  (really should check that directory exists too, but this is just a test)
#
mkdir -p /vol/grid-solar/sgeusers/fred/$JOB_ID
cp krb_tkt_flow.png  /vol/grid-solar/sgeusers/fred/$JOB_ID
#
echo "Ran through OK"

As some people on windows machines have had problems cutting-and-pasting the above content, a downloadable version is available as this attachment:

Basic job-related commands

qstat                    shows you the state of your jobs

qstat -u \*              shows you the state of all jobs

qsub script_name         submits the job defined in the script into the queuing system

qdel job_number          deletes the job with the job_number from the queuing system

Note that if you wish to force a job deletion, you will need to run the qdel from greta-pt, the School's general purpose server, not your workstation.

Emailed output

During the development phase of the enabling of an existing workflow to be run within the grid, especially when the grid's resources are busy, it can be useful to be notifed by email that a job has started, failed or ended.

The qsub man page suggests that you can add various command line options to give you this functionality, however, in common with other command line options, you can place these "directives" inside the job submissions script. An example of using submission script directives is:

#
# Mail me at the b(eginning) and e(nd) of the job
#
#$ -M Fred.Bloggs@ecs.vuw.ac.nz
#$ -m be
#

but be aware that trying to email too many messages out from your account may see you exceed overall mail quotas.

If Fred does choose to notify himself by email , he will see an email message like this when the job starts:

Subject:      Job 341642 (freds_test.sh) Started

Job 341642 (freds_test.sh) Started
 User       = fred
 Queue      = GX755
 Host       = lumiere.ecs.vuw.ac.nz
 Start Time = 03/18/2009 16:20:54

and one like this when it ends:

Subject:      Job 341642 (freds_test.sh) Complete

Job 341642 (freds_test.sh) Complete
 User             = fred
 Queue            = GX755@lumiere.ecs.vuw.ac.nz
 Host             = lumiere.ecs.vuw.ac.nz
 Start Time       = 03/18/2009 16:20:54
 End Time         = 03/18/2009 16:20:55
 User Time        = 00:00:00
 System Time      = 00:00:00
 Wallclock Time   = 00:00:01
 CPU              = NA
 Max vmem         = NA
 Exit Status      = 0

Array Jobs (Task Array Jobs)

The previous example is fine for the submission of one-off jobs, however there may be cases where you may wish to submit the same process, multiple times but where each invocation of a process will perform different tasks, based its place within the full set of jobs.

This can be realised by submitting multiple single jobs but another way to do this is to submit an Array Job.

An "array job" is submitted by using the -t F-L:S (First-Last:Step) option, for example

qsub -t 1-10:2 my_submission_script.sh

submits 5 jobs into the grid.

Whilst this form of submission is referred to as a Job Array, the option used, -t and the environmental variable used to differentiate between the array jobs $SGE_TASK_ID, suggests that it could be referred to as a Task Array Job.

An example script, similar to the basic submission script example, but showing one way to differeniate the array job output is available as this attachment:

Managing Array Jobs (Task Array Jobs)

As the single task submission notes above state, a temporary work area is created for each job, based on the environmental variable $JOB_ID, however, in grid environments where more than one job can run on a single machine, the tasks from a Job Array submission would all have the same $JOB_ID, and so a second environmental variable, $SGE_TASK_ID, is provided so as to differentiate between the tasks.

The temporary directory created for Array Jobs is thus of the form

/local/tmp/[username]/$JOB_ID.$SGE_TASK_ID

so, if, for the example given using -t 1-10:2, the JOB_IB was 1234, and the user was fred these directories would get created (in the case of the ECS Grid, on five different machines)

/local/tmp/fred/1234.1
/local/tmp/fred/1234.3
/local/tmp/fred/1234.5
/local/tmp/fred/1234.7
/local/tmp/fred/1234.9

and the combination of the two environmentals, eg JOB_ID.$SGE_TASK_ID can be used as a general construct for differentiating between the individual tasks.

Specialised job summission

As previously detailed, basic jobs just need to run on some machine within the Grid.

There may, of course, be classes of job where ensuring that all tasks run on the same architecture is desirable, an example within ECS being a desire to ensure run timings were not influenced by differences in the model of machine that individual tasks were executed on.

Similarly, temporary resource partitioning requests that ensure students in a lab tutorial can target the machines in the lab that has been booked for them, require a handle through which the user can access a subset of the full SGE Grid.

A number of the SGE utilities, including qsub allow for a resource request list to be defined by use of the

 -l resource=value

where the resources are maintained within the SGE Complex

Currently, the local additions (some of which may not always be populated) to the SGE Complex are

ecs_df_local 
ecs_model
ecs_netgroup
ecs_room

so a user wishing to target only those machines which are the model GX745 would need to add

 -l ecs_model=GX745

to their SGE command.

Targetting machine architectures

Whilst platform neutral stuff, eg school-wide packages you run in batch mode, or Java programs, should not be affected, if at all, by differences in the underlying architecture machines within the grid, it may be useful to specify the architecture you expect your jobs to run on, so as to future proof your job submission scripts.

To specifically request the OS you want when submitting the job you can use use the -l argument to qsub

A qsub command targetting the 32-bit ArchLinux machines would be

 qsub -l arch=lx-x86 your_script.sh

A second approach involves checking for the OS your job ends up running on, in the submission script.

The SGE will actually set an environmental variable for you to test against,eg

SGE_ARCH=lx-x86 

You should also have access to the utility that SGE uses to provide its own view of things, on all the machines in the grid, as

 /usr/pkg/sge/util/arch

an invocation of which will return lx-x86

So, for example, you might have, if you choose to use the value that SGE would return to differentitate, directories containing OS-specific binaries with the same name:

  /vol/grid-solar/username/mycodes/bin/lx-x86/prog1

or programs, where the names specifies the architecture:

  /vol/grid-solar/username/mycodes/bin/prog1.lx-x86

Here is a template (in Bourne shell syntax) that will allow you to branch your submission script, your commands would go where the "I could run" echo statements appear:

if [ -z "$SGE_ARCH" ]; then
     echo "Can't determine SGE ARCH"
 else
     if [ "$SGE_ARCH" = "lx-x86" ]; then
         echo "I could run a Linux x86 binary"
     fi
 fi

and a similar version for the C shell syntax (though it is a good idea to write your job submission scripts in Bourne shell syntax)

if ( $?SGE_ARCH == 0 ) then
    echo "Can't determine SGE ARCH"
else
    if ( $SGE_ARCH == "lx-x86" ) then
        echo "I could run a Linux x86 binary"
    endif
endif

Compilation for the ArchLinux machines

If you are someone without access to the ECS lab machines, you'll not have access to an ArchLinux machine on which to compile code targetting those grid resources.

In this case, however, you should find that binaries compiled for i386 (so not x86_64) on other GNU/Linux machines you have access to may work.

If you experience other problems in this area, please get in touch with us, whilst we await the deployment of a general purpose ArchLinux 32-bit server.

Running Java programs on the ECS/SGE Grid

Running Java programs on the ECS/SGE Grid has always been compilcated by the fact that to set up the full Java environment you would get when running interactively, using a need javaXYZ command, you did not have access to the need facility from within the default Grid environment.

It seems worth providing some basic guidelines that should allow many Java programs to operate.

With a bit of guinea-pig'ing by Roman Klapaukh, it would appear that the following stanza should allow one to submit Java programs to the ECS Grid, without worrying about the OS your job ends up running against.

Note that this solution uses the mechanism outlined above for determining the OS and thus, should you need to branch any other operations on that test, you could combine them.

Note also that you're probably not the user fred so YOU WILL NEED TO EDIT THE SCRIPT

#!/bin/sh
#
# Force Bourne Shell if not Sun Grid Engine default shell (you never know!)
#
#$ -S /bin/sh
#
# I know I have a directory here so I'll use it as my initial working directory
#
#$ -wd /vol/grid-solar/sgeusers/fred 
#
# Now let's do something useful, but first change into the job-specific directory that should
#  have been created for us
#
if [ -d /local/tmp/fred/$JOB_ID ]; then
        cd /local/tmp/fred/$JOB_ID
else
        echo "There's no job directory to change into "
        echo "Something is broken. I should inform the programmers"
        echo "Save some information that may be of use to them"
        echo "Here's LOCAL TMP "
        ls -la /local/tmp
        echo "AND LOCAL TMP FRED "
        ls -la /local/tmp/fred
        echo "Exiting"
        exit 1
fi
#
if [ -z "$SGE_ARCH" ]; then
   echo "Can't determine SGE ARCH"
else
   if [ "$SGE_ARCH" = "lx-amd64" ]; then
       JAVA_HOME="/usr/pkg/java/sun-8"
   fi
fi

if [ -z "$JAVA_HOME" ]; then
   echo "Can't define a JAVA_HOME"
else
   export JAVA_HOME
   PATH="/usr/pkg/java/bin:${JAVA_HOME}/bin:${PATH}"; export PATH

   java Hello
fi

Note that the JAVA_HOME path matches the default that you would get running a bare java command when sitting at an ECS workstation.

The School maintains other Java installations, so if you wish to use those, you will need to edit the script accordingly.

You can't use need inside a basic job submission script

The Java example aboive highlights an wider issue for writing job submission scripts for the ECS Grid.

Some programs, when run in an interactive shell on ECS/MSOR machines, require the user to have to modify thier shell environment by typing

need [pkgname]

however, the need() function is not available within the non-interactive shells that the grid jobs run in, which requires the user to be aware of what a

need [pkgname]

is actually doing and, should the user find that the program when run inside a grid job requires the functionality from the "need file", replicate it within the script, as shown in the Java example above.

You can see what an invocation of need [pkgname>] does if, from a shell, you look at the file etc/pkgs/[pkgname].sh

It is likely that you won't want to bother replicating everything a "need file" would do (eg, setting the MANPATH for a non-intercative grid job) but you are unlikey to cause any issues if you do replicate everything.

If you do try to use need [pkgname] inside a job submission script, you are likely to see a message akin to this

"ERROR: configuration error -- this is the wrong version of need
        You should never see this -- please report to bugs@ecs.vuw.ac.nz" 

in which case, even if you current job does run, reporting to bugs might help us to give you some extra infomation.

Using DRMAA with the ECS/SGE Grid

This section provides some information and typical commands required to compile and run codes making use of DRMAA.

Background

Some simple source code examples of using DRMAA via the C and Java bindings have been placed below:

/vol/grid-solar/sgeusers/admin/DRMAA

as an introduction to users wishing to experiment with DRMAA codes.

The C example sources originally come from a Dr Dobbs Journal article (2004, Frederic Pariente)

http://www.ddj.com/184405932

which can seemingly still be found at:

http://www.drdobbs.com/184405932

though an archived version, less the Flash adverts, of the original article is provided locally:

/vol/grid-solar/sgeusers/admin/DRMAA/DrDobbs-2004-Article/

The Java example sources come from the SGE source code distribution, although a small change is required in order to have the codes work as expected.

C Bindings

The C bindings make use of the header file

/usr/pkg/sge/include/drmaa.h

and the default SGE shared library

/usr/pkg/sge/lib/lx-amd64/libdrmaa.so.1.0

Java Bindings

As you will have read above these are not currently installed

The Java bindings make use of a locally-compiled JAR-file and dynamic library

/vol/grid-solar/sgeusers/admin/DRMAA/lx-x86/drmaa.jar

/vol/grid-solar/sgeusers/admin/admin/DRMAA/lx-x86/libdrmaa.so

which required a rebuild from the SGE sources.

Simple, proof of concept example: C Binding

Create a directory ~/DRMAA, change into that directory and copy the exammple codes provided over,

% cp /vol/grid-solar/sgeusers/admin/DRMAA/DrDobbs-Code/* .

Compile and link the proof of concept source

% gcc -c -I/usr/pkg/sge/include/ ListingOne.c
% gcc -o  ListingOne \
      -L/usr/pkg/sge/lib/lx-x86/ \
      -Wl,-R/usr/pkg/sge/lib/lx-x86/ -ldrmaa ListingOne.o

You did rememeber to

% need sgegrid

Now we can test that things work

% ./ListingOne 
Successfully started the DRMAA library

Spawning an actual job into the SGE: C Binding

The file drdobbs-shell.c is a slightly modified version of ListingTwo.c, which allows one to specify the script to be executed as a commnad line argument and sets a SGE-native option required to tell SGE to "do the right thing"

Compile the source

% gcc -c -I/usr/pkg/sge/include/ drdobbs-shell.c
% gcc -o drdobbs-shell \
      -L/usr/pkg/sge/lib/lx-amd64/ \
      -Wl,-R/usr/pkg/sge/lib/lx-amd64/ -ldrmaa \
 drdobbs-shell.o

Edit, or otherwise replace, the placeholder username "fred" used within the job submission script to match your username

% mv i_am_alive.sh i_am_alive.sh.orig
% sed -e "s/fred/yourusername/g" i_am_alive.sh.orig > i_am_alive.sh
% chmod u+x i_am_alive.sh

As written the DRMAA code that will spawn the job will not pay attention to the directory you are in when you use the DRMAA executable to spawn your job script into the SGE, so we run as follows:

% ~/DRMAA/drdobbs-shell ~/DRMAA/i_am_alive.sh
Your job "/u/students/fred/DRMAA/i_am_alive.sh"has been submitted with id 000000
%

after which you should find the log files from the running of your script in your home directory

% ls -ltr ~
...
drwx------  2 fred  students    512 Oct  6 12:02 DRMAA
-rw-r--r--  1 fred  students      0 Oct  6 12:20 i_am_alive.sh.e000000
-rw-r--r--  1 fred  students  29753 Oct  6 12:20 i_am_alive.sh.o000000
%

Note that, as written, the directive at the top of the job submission script which requests an initial working directory

#$ -wd /vol/grid-solar/sgeusers/fred

has been ignored and inspection of the output file confirms this:

% cat ~/i_am_alive.sh.o000000
==UNAME==
breaker.msor.vuw.ac.nz
==WHO AM I and GROUPS==
uid=0000(fred) gid=25(students) groups=25(students),1500(c302t1)
students c302t1
==SGE_O_WORKDIR==
/home/rialto1/fred/DRMAA
...

Spawning an actual job into the SGE: Java Binding

As you will have read above the supporting files for the Java Binding are not currently installed

The modified version of the SGE-provided Howto2.java adds the setting of an SGE-native option required to tell SGE to "do the right thing" and removes the package qualification from the code.

Create a directory DRMAA, change into that directory and copy the exammple codes provided over,

cp /vol/grid-solar/sgeusers/admin/DRMAA/SGE-Code/* .

Make sure you have the "Native" version of Java

% need java2-native

Compile the Java source against the locally-built DRMAA JAR-file

% javac -cp /vol/grid-solar/sgeusers/admin/DRMAA/lx-amd64/drmaa.jar:. Howto2.java

As written, the DRMAA code will look for the hard-coded script it is going to launch (sleeper.sh) as the SGE job, in your home directory so we copy it there and ensure it is executable, before running the DRMAA code.

You did remember to

% need sgegrid

as well though?

Notice that we need to tell Java to use the locally built dynamic library as well, by defining the search path within the Java environment

% cp sleeper.sh ~
% chmod u+x ~/sleeper.sh 
% java -Djava.library.path=/vol/grid-solar/sgeusers/admin/DRMAA/lx-amd64 \
   -cp /vol/grid-solar/sgeusers/admin/DRMAA/lx-amd64/drmaa.jar:. Howto2
Your job has been submitted with id 000000
%

after which you should find the log files from the running of your script in your home directory

% ls -ltr ~
...
-rwx------  1 fred    1746 Sep 30 14:50 sleeper.sh
drwx------  2 fred     512 Sep 30 15:25 DRMAA
-rw-r--r--  1 fred       0 Sep 30 15:28 Sleeper.e000000
-rw-r--r--  1 fred      99 Sep 30 15:28 Sleeper.o000000
% cat ~/Sleeper.o
Here I am. Sleeping now at: Tue Sep 30 15:28:06 NZDT 2014
Now it is: Tue Sep 30 15:28:11 NZDT 2014
%

Note that the job submission has taken notice of the script directive requesting that our job have the name "Sleeper"

#$ -N Sleeper

and that this is reflected in the names of the logfiles.

Caveats

Environmental variables now have an SGE_ prefix not GE_

As noticed by many but formally pointed out by Kourosh Neshatian (and, belatedly, Lloyd Parkes):

The Sun documentation and man pages for the Sun Grid Engine (SGE) mention environmental variables of the form GE_SOME_THING.

The docs are out of date with respect to current SGE implementations and users should be using environmental variables of the form SGE_SOME_THING

Jobs in Error states: unable to chdir

We have seen occurences of jobs being unable to start, and thus entering an Error state( ie,showing as Eqw when the user does a qstat)

If, in response to a

qstat -explain E -j 12345

where 12345 is the job number, you are told that the the error reason was an inablity to change to a directory that you know to be there, then you may simply have been a victim of networking congestion on the fileserver, at the time the job tried to start.

In this case, you should be able to clear the error condition by carrying out these steps

  • ssh greta-pt
  • need sgegrid
  • qmod --clearjob 12345

If the job doesn't start then you should send an email to jobs giving us as much detail as possible.
I Attachment Action Size Date Who Comment
SGE-User-Guide-820-0699.pdfpdf SGE-User-Guide-820-0699.pdf manage 2 MB 16 Sep 2013 - 11:00 Main.kevin Sun N1 Grid Engine 6.1 User's Guide
submission_script-basic.shsh submission_script-basic.sh manage 2 K 29 Aug 2016 - 12:59 Main.kevin The basic submission script
submission_script-task_array.shsh submission_script-task_array.sh manage 2 K 29 Aug 2016 - 13:05 Main.kevin The task array submission script