|
|
|
Rules for running "big jobs" on pangea
Last revision July 19, 2004
Pangea is the main server and timesharing computer for the School of
Earth Sciences.
It is a powerful computer, but subject to heavy use from a large user
community of about 500 people.
During peak daytime hours, typically more than 100 users are logged in
for timesharing use.
At the same time, there may be demands for file or print service from
workstations and personal computers.
During off-hours, several people may be trying to run big jobs at once
to do scientific analyses.
In order to fairly distribute the computing capacity among all users,
the following policy has been created to control "big jobs".
Normal types of interactive use, such as reading email or editing files,
do not count as big jobs.
However, scientific programs, some large compilations,
find
commands that are searching a large portion of the disks,
and similar programs are usually big jobs.
For the purposes of this policy, a big job is defined as a process
(or pipeline of processes) that:
(1) will use more than 10 CPU minutes;
or
(2) will use more than 20,000 Kilobytes of real physical memory at one time;
or
(3) will use more than 250,000 Kilobytes of virtual memory (swap space).
This definition and policy applies equally to processes running in
the foreground (in control of the terminal) or background (detached
from the terminal).
If you violate this policy, and your big job(s) results in a significant
drag on pangea, then the system manager will either kill your job
outright, or lower its priority to the lowest setting.
Big job rules for pangea:
- Never run more than one big job at a time. This is the
most important rule. It applies to both foreground and background jobs. For example,
if you have a big job in the background, you can only run smaller processes in
the foreground.
In fact, it is often counterproductive to run several big jobs at once. These
jobs will compete with each other. If this causes the system to waste CPU power
constantly switching back and forth, then it will take longer to complete them
all than if they had run sequentially. You can easily make a small
shell script that will run programs sequentially; then you can simply start
that script instead of starting multiple jobs at once.
-
Big jobs must be run at lowered priority, using the
nice command. Priority levels range from 0 (highest) to 20 (lowest). A "normal"
big job should run at priority 10. If you know from previous experience that your
program is really big, for example, it takes 500 CPU minutes to complete,
then it should run at priority 19. This way, it gets CPU time when the system
is idle, but does not interfere with interactive users. For example, to start
the program gepinv with priority 10, use the command: nice
+10 gepinv
Big jobs that use a great deal of real physical memory should not
be run during daytime hours at all. Even though they may have their priority
lowered, so they don't hog the CPU time, they are still hogging memory. During
the daytime, when we have 100 users logged in doing mail, editing files, etc.,
memory can be a bigger constraint than CPU time. If a few jobs hog the memory,
everybody else experiences a slowdown in response because their processes have
to be constantly swapped in and out of memory. Jobs that use more than 200,000
Kilobytes of real memory at one time should not be run during the daytime.
- Big jobs should not be run with the program name a.out.
This is the default name given by the
compiler and on pangea this is interpreted to mean "this is a test job". Test
jobs should not be allowed to run very long. In fact, on pangea, processes named
a.out will be automatically killed after they have used 5 minutes
of CPU time. Production programs should be given real names, like ressim
for a reservoir simulator, that give some idea of what they do. This way,
the system manager can at least make some guess whether the job should
be taking so long. It is very common for users to make programming mistakes that
cause their programs to go into infinite loops, consuming resources forever. Using
the name a.out is a signal that you are in a test mode where that
can happen; therefore if it does start consuming much CPU time or memory, this
is a signal to the manager to kill it. Using a real name is a signal to the manager
that this is a production program that perhaps will use a lot of CPU time for
productive purposes, and the manager should contact you if it seems to be using
too much resources, rather than just kill it.
Finally, users must periodically check test programs or long running
jobs. Unless you know that your program is free of bugs and can run
unattended for a long time, do not just start it and leave. Periodically check
the output files from your processes to make sure that they are doing something
reasonable and not just consuming CPU time from a bug. Users have a responsibility
to "watch" their programs and take action to kill them if they appear to have
bugs that are wasting resources without accomplishing any real work. Learn how
to use the ps and kill commands
to watch your big background jobs and kill them, if necessary.
|