HTCondor Useful Commands
https://research.iac.es/sieinvens/siepedia/pmwiki.php?n=HOWTOs.CondorUsefulCommands
HTCondor has several dozens of commands, but in this section we will present just the most common ones (if you want to check the complete list, try the Command Reference page). Also remember that you can get further information running man condor_<cmd>
in your shell or visiting the official Users' Manual.
Checking Pool Status
condor_status
condor_status
List slots in HTCondor pool and their status:
Owner
(used by owner),Claimed
(used by HTCondor),Unclaimed
(available to be used by HTCondor), etc.
Useful options:
-avail
: List those slots that are not busy and could run HTCondor jobs at this moment-submitters
: Show information about the current general status, like number of running, idle and held jobs (and submitters)-run
: List slots that are currently running jobs and show related information (owner of each job, machine where it was submitted from, etc.)-compact
: Compact list, with one line per machine instead of per slot-state -total
: List a summary according to the state of each slot-master
: List machines, but just their names (status and slots are not shown)-server
: List attributes of slots, such as memory, disk, load, flops, etc.-af <attr1> <attr2> <...>
: List specific attributes of slots, using autoformat (new version, very powerful)-format <fmt> <attr>
: List attributes using the specified format (old version). For instance, next command will show the name of each slot and the disk space:condor_status -format "%s\t " Name -format "%d KB\n" Disk
-constraint <constraint>
: Only Show slots that satisfy the constraint. I.e:condor_status -constraint 'Memory > 1536'
will only show slots with more than 1.5GB of RAM per slot.
Submitting jobs
condor_submit
<submit_file>
condor_submit
<submit_file>
Submit jobs to the HTCondor queue according to the information specified in
submit_file
. Visit the submit file page to see some examples of these files. There are also some FAQs related to the submit file.
Useful options:
-dry-run <dest_file>
: this option parses the submit file and saves all the related info (name and locations of input and output files after expanding all variables, value of requirements, etc.) to<dest_file>
, but jobs are not submitted. Using this option is highly recommended when debugging or before the actual submission if you have made some modifications in your submit file and you are not sure whether they will work.-append <command>
: add submit commands at submission time, without changing the submit file. You can add more than one command using several times-append
.
When submitted, each job is identified by a pair of numbers X.Y, like 345.32. The first number (X) is the cluster id: every submission gets a different cluster id, that is shared by all jobs belonging to the same submission. The second number (Y) is the process id: if you submitted N jobs, then this id will go from 0 for the first job to N-1 for the last one. For instance, if you submit a file specifying 4 jobs and HTCondor assign id 523 to that cluster, then the ids of your jobs will be 523.0, 523.1, 523.2 and 523.3.
Before submitting your jobs, it is recommended to do some simple tests in order to make sure that both your submit file and program work in a proper way: if you are going to submit many jobs and each job takes several hours to finish, before doing that try with just a few jobs and change the input data in order to let them finish in minutes. Then check the results to see if everything went fine before submitting the real jobs. Bear in mind that submitting untested files and/or jobs may cause a waste of time and resources if they fail, and also your priority will be lower in following submissions.
Checking and Managing Submitted Jobs
condor_q
condor_q
Show my jobs that have been submitted in this machine. By default you will see the ID of the job (
clusterID.processID
), the owner, submitting time, run time, status, priority, size and command. [STATUS: I:idle (waiting for a machine to execute on); R: running; H: on hold (there was an error, waiting for user's action); S: suspended; C: completed; X: removed; <: transferring input; and >: transferring output]
Useful options:
-global
: Show my jobs submitted in any machine, not only the current one-wide
: Do not truncate long lines. You can also use-wide:<n>
to truncate lines to fitn
columns-analyze <job_id>
: Analyse a specific job and show the reason why it is in its current state (useful for those jobs in Idle status: Condor will show us how many slots match our restrictions and may give us suggestion)-better-analyze <job_id>
: Analyse a specific job and show the reason why it is in its current state, giving extended info-long <job_id>
: Show all information related to that job-run
: Show your running jobs and related info, like how much time they have been running, in which machine, etc.-currentrun
: Show the consumed time on the current run, the cumulative time from last executions will not be used (you can combine also with-run
flag to see only the running processes at the moment)-hold
: Show only jobs in the "on hold" state and the reason for that. Held jobs are those that got an error so they could not finish. An action from the user is expected to solve the problem, and then he should use thecondor_release
command in order to check the job again-af <attr1> <attr2> <...>
: List specific attributes of jobs, using autoformat
condor_tail
<job_id>
condor_tail
<job_id>
Display on screen the last lines of the
stdout
(screen) of a running job on a remote machine. You can use this command to check whether your job is working fine, you can also visualize errors (stderr
) or output files created by your program (see also this FAQ).
Useful options:
-f
: Do not stop displaying the content, it will be displayed until interrupted withCtrl+C
-no-stdout -stderr
: Show the content ofstderr
instead ofstdout
-no-stdout <output_file>
: Show the content of an output file (output_file
has to be listed in thetransfer_output_files
command in the submit file).
condor_release
<job_id>
condor_release
<job_id>
Release a specific held job in the queue.
Useful options:
<cluster_id>
: Instead of giving a<job_id>
, you can specify just the<cluster_id>
in order to release all held jobs of a specific submission-constraint <constraint>
: Release all my held jobs that satisfy the constraint-all
: Release all my held jobs
Jobs with on hold state are those that HTCondor was not able to properly execute, usually due to problems with executable, paths, etc. If you can solve the problems changing the input files and/or the executable, then you can use condor_release
command to run again your program since it will send again all files to the remote machines. If you need to change the submit file to solve the problems, then condor_release
will NOT work because it will not evaluate again the submit file. In that case you can use condor_qedit
(see this FAQ) or cancel all held jobs and re-submit them again
condor_hold
<job_id>
condor_hold
<job_id>
Put jobs into the hold state. It could be useful when you detect that there are some problems with your input data (see this FAQ for more info), you are running out of disk space for outputs, etc. With this command you can delay the execution of your jobs holding them, and, after solving the problems, assign them the idle status using
condor_release
, so they will be executed again.
Useful options:
<cluster_id>
: Instead of giving a<job_id>
, you can specify just the<cluster_id>
in order to hold all jobs of a specific submission-constraint <constraint>
: Hold all jobs that satisfy the constraint-all
: Hold all my jobs from the queue
condor_rm
<job_id>
condor_rm
<job_id>
Remove a specific job from the queue (it will be removed even if it is running). Jobs are only removed from the current machine, so if you submitted jobs from different machines, you need to remove your jobs from each of them.
Useful options:
<cluster_id>
: Instead of giving a<job_id>
, you can specify just the<cluster_id>
in order to remove all jobs of a specific submission-constraint <constraint>
: Remove all jobs that satisfy the constraint-all
: Remove all my jobs from the queue-forcex <job_id>
: It could happen that after removing jobs, they don't disappear from the queue as expected, but they just change status to X. That's normal since HTCondor may need to do some extra operations. If jobs stay with 'X' status a very long time, you can force their elimination adding-forcex
option. For instance:condor_rm -forcex -all
.
Getting Info from Logs
condor_userlog
<file.log>
condor_userlog
<file.log>
Show and summarize job statistics from the job log files (those created when using
log
command in the submit file)
condor_history
condor_history
Show all completed jobs to date (it has to be run in the same machine where the submission was done).
Useful options:
-userlog <file.log>
: list basic information registered in the log files (usecondor_logview <file.log>
to see information in graphic mode)-long XXX.YYY -af LastRemoteHost
: show machine where job XXX.YYY was executed-constraint <constraint>
: Only show jobs that satisfy the constraint. I.e:condor_history -constraint 'RemoveReason=!=UNDEFINED'
: show your jobs that were removed before completion
There is also an online tool to analyze your log files and get more information: HTCondor Log Analyzer
(http://condorlog.cse.nd.edu/ ).
Other Commands
condor_userprio
condor_userprio
Show active HTCondor users' priority. Lower values means higher priority where 0.5 is the highest. Use
condor_userprio -allusers
to see all users' priority, you can also add flags-priority
and/or-usage
to get detailed information
condor_submit_dag
<dag_file>
condor_submit_dag
<dag_file>
Submit a DAG file, used to describe jobs with dependencies. Visit the Submit File (HowTo) section for more info and examples.
condor_version
condor_version
Print the version of HTCondor.
Last updated