Stanford University School of Earth Science
 
Home
News
New Users
Policies
Email
Web Hosting
Get Help
Net Connections
Macintosh
Windows PC
Unix/Linux System
Pangea Server
   Use Policy
   Account Info
   Account Mgt
   Passwords
   Logging In
   Email Services
   Posting Events
   Web Hosting
   File Storage
   File Transfer
   Printing
   FAQ
School Resources
Using Unix

Rules for running "big jobs" on pangea

Last revision July 19, 2004

Pangea is the main server and timesharing computer for the School of Earth Sciences. It is a powerful computer, but subject to heavy use from a large user community of about 500 people. During peak daytime hours, typically more than 100 users are logged in for timesharing use. At the same time, there may be demands for file or print service from workstations and personal computers. During off-hours, several people may be trying to run big jobs at once to do scientific analyses.

In order to fairly distribute the computing capacity among all users, the following policy has been created to control "big jobs". Normal types of interactive use, such as reading email or editing files, do not count as big jobs. However, scientific programs, some large compilations, find commands that are searching a large portion of the disks, and similar programs are usually big jobs.

For the purposes of this policy, a big job is defined as a process (or pipeline of processes) that: (1) will use more than 10 CPU minutes; or (2) will use more than 20,000 Kilobytes of real physical memory at one time; or (3) will use more than 250,000 Kilobytes of virtual memory (swap space). This definition and policy applies equally to processes running in the foreground (in control of the terminal) or background (detached from the terminal).

If you violate this policy, and your big job(s) results in a significant drag on pangea, then the system manager will either kill your job outright, or lower its priority to the lowest setting.

Big job rules for pangea:

  1. Never run more than one big job at a time. This is the most important rule. It applies to both foreground and background jobs. For example, if you have a big job in the background, you can only run smaller processes in the foreground.

    In fact, it is often counterproductive to run several big jobs at once. These jobs will compete with each other. If this causes the system to waste CPU power constantly switching back and forth, then it will take longer to complete them all than if they had run sequentially. You can easily make a small shell script that will run programs sequentially; then you can simply start that script instead of starting multiple jobs at once.

  2. Big jobs must be run at lowered priority, using the nice command. Priority levels range from 0 (highest) to 20 (lowest). A "normal" big job should run at priority 10. If you know from previous experience that your program is really big, for example, it takes 500 CPU minutes to complete, then it should run at priority 19. This way, it gets CPU time when the system is idle, but does not interfere with interactive users. For example, to start the program gepinv with priority 10, use the command: nice +10 gepinv

  3. Big jobs that use a great deal of real physical memory should not be run during daytime hours at all. Even though they may have their priority lowered, so they don't hog the CPU time, they are still hogging memory. During the daytime, when we have 100 users logged in doing mail, editing files, etc., memory can be a bigger constraint than CPU time. If a few jobs hog the memory, everybody else experiences a slowdown in response because their processes have to be constantly swapped in and out of memory. Jobs that use more than 200,000 Kilobytes of real memory at one time should not be run during the daytime.

  4. Big jobs should not be run with the program name a.out. This is the default name given by the compiler and on pangea this is interpreted to mean "this is a test job". Test jobs should not be allowed to run very long. In fact, on pangea, processes named a.out will be automatically killed after they have used 5 minutes of CPU time. Production programs should be given real names, like ressim for a reservoir simulator, that give some idea of what they do. This way, the system manager can at least make some guess whether the job should be taking so long. It is very common for users to make programming mistakes that cause their programs to go into infinite loops, consuming resources forever. Using the name a.out is a signal that you are in a test mode where that can happen; therefore if it does start consuming much CPU time or memory, this is a signal to the manager to kill it. Using a real name is a signal to the manager that this is a production program that perhaps will use a lot of CPU time for productive purposes, and the manager should contact you if it seems to be using too much resources, rather than just kill it.
  5. Finally, users must periodically check test programs or long running jobs. Unless you know that your program is free of bugs and can run unattended for a long time, do not just start it and leave. Periodically check the output files from your processes to make sure that they are doing something reasonable and not just consuming CPU time from a bug. Users have a responsibility to "watch" their programs and take action to kill them if they appear to have bugs that are wasting resources without accomplishing any real work. Learn how to use the ps and kill commands to watch your big background jobs and kill them, if necessary.

 


Comments?

Stanford University    |