Diagnosing Errors
On this page is a list of potential issues that could arise while running the pipeline, which are typically associated with runtime errors, that could be due to issues with the cluster (e.g., compute-node failure during running your job) or job parameterisation (e.g. underestimating RAM required).
This is not an exhaustive list, and is intended to give the user a sense of which problems may need to be reported to support@ilifu.ac.za and which problems can be ignored.
Unable to launch jobs in SLURM
If the status of your job is (launch failed requeued held)
, please file a ticket with support@ilifu.ac.za.
Node Failure
You may encounter the rare error of a node failure, which shows an error message similar to the following within your logs:
JOB 14024 ON compute-010 CANCELLED AT 2020-08-10T03:55:21 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
If you encounter this error, please file a ticket with support@ilifu.ac.za.
Memory error
If you see the phrase oom
(Out of Memory) or MemoryError
in the .err
logs (this can be located by grep -i oom logs/*.err
), this is typically indicative that CASA did not have enough memory to complete the task. This often happens while running flagdata
and does not always halt execution of the pipeline. If you are not using the maximum amount of memory per node, increase your allocation (up to 232 GB on an ilifu node from the Main partition, or 480 GB on an ilifu node from the HighMem partition). If you are using the maximum amount of memory per node, reduce the number of tasks per node and consider increasing the nodes in the config file (e.g. halve tasks and double nodes) before re-launching the pipeline, as that will allocate more memory per task.
summary.sh
will show a State
of OUT_OF_ME+
for jobs that have run out of memory, and the following error should appear in the .err
logs (e.g. for jobID 1234567):
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=1234567.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: compute-070: task 0: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=1234567.0
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=1234567.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
There are cases where a failure in flagdata
can leave the MS in an intermediate state that causes the subsequent calibration tasks to fail. If the pipeline run did not cancel, we recommend killing any currently running jobs (by running ./killJobs.sh
from the parent directory), wiping the *MHz
subdirectories, and re-running processMeerKAT.py -R [-C <config_file>]
and ./submit_pipeline.sh
again after making the above changes to the config file. Alternatively, you could attempt to resume the pipeline.
False positives
Server timeout errors
Errors of the form
MPIMonitorClient::get_server_timeout::MPIMonitorClient::get_server_timeout::@slwrk-155:MPIClient Found 1 servers in timeout status
are benign and do not have an impact on the pipeline performance. The MPI daemon simply times out waiting for a worker node to respond, and prints out this error to the logs. SLURM is able to handle these timeouts gracefully and restarts the process once it times out.
Empty rows in sub-MSs
*** Error *** Error in data selection specification: MSSelectionNullSelection : The selected table has zero rows.
Some tasks might complain that no valid data were found in a sub-MS, due to some combination of data selection parameters resulting in a null selection for a specific sub-MS. Generally this seems to be a “harmless” error, and doesn’t seem to affect the progress of the calibration/pipeline.
“No valid SPW and Chan combination found”
Errors of the form
agentflagger::::MPIServer-31 (file ../../tools/flagging/agentflagger_cmpt.cc, line 35) Exception Reported: No valid SPW & Chan combination found
often show up in the logs of either flag_round_1
or flag_round_2
or both. Similar to the applycal error, this is caused by a combination of data selection parameters leading to a null selection for this particular subMS. it simply means that the flagging range requested lies outside the frequency range in the target MS/subMS.
UTC offset by more than 1s
Errors of the form
Leap second table TAI_UTC seems out-of-date. Until the table is updated (see the CASA documentation or your system admin), times and coordinates derived from UTC could be wrong by 1s or more.
are fairly common and are completely benign. These errors are auto-generated by CASA when the internal data repository has not been updated for some pre-defined length of time. While we try to keep our containers up to date, these errors can still occur and have no impact on the image quality or fidelity.
Generic applycal error
Sometimes the findErrors.sh
script will report generic errors in the applycal
task that are something like
SEVERE applycal::::@slwrk-128::MPIServer-7 An error occurred running task applycal
There are two ways to determine if this is a “false positive”. Locate the script in question inside the logs directory of the relevant SPW directory and inspect the .casa
log. There should be several successful applycal
instances. There should also be logs that read
*** Error *** Error in data selection specification: MSSelectionNullSelection : The selected table has zero rows.
in either the corresponding .out
or .err
file. These errors basically arise because a combination of chan/SPW/field/time selection has resulted in a null selection for one subMS inside the MMS causing applycal to fail for that one subMS. These errors do not have any impact on the final image quality.