Debug

For ParallelCluster and Slurm issues, refer to the official AWS ParallelCluster Troubleshooting documentation.

Slurm Head Node

If slurm commands hang, then it's likely a problem with the Slurm controller.

Connect to the head node from the EC2 console using SSM Manager or ssh and switch to the root user.

sudo su

The first thing to do is to ensure that the Slurm controller daemon is running:

systemctl status slurmctld

If it isn't then first check for errors in the user data script. The following command will show the output:

grep cloud-init /var/log/messages | less

Then check the controller's logfile.

/var/log/slurmctld.log

The following command will rerun the user data.

/var/lib/cloud/instance/scripts/part-001

Another way to debug the slurmctld daemon is to launch it interactively with debug set high. The first thing to do is get the path to the slurmctld binary.

slurmctld=$(cat /etc/systemd/system/slurmctld.service | awk -F '=' '/ExecStart/ {print $2}')

Then you can run slurmctld:

$slurmctld -D -vvvvv

Compute Nodes

If there are problems with the compute nodes, connect to them using SSM Manager.

Check for cloud-init errors the same way as for the slurmctl instance. The compute nodes do not run ansible; their AMIs are configured using ansible.

Also check the slurmd.log.

Check that the slurm daemon is running.

systemctl status slurmd

Log Files

Logfile Description
/var/log/slurmd.log slurmctld logfile

Job Stuck in Pending State

You can use scontrol to get detailed information about a job.

scontrol show job *jobid*

Job Stuck in Completing State

When a node starts it reports it's number of cores and free memory to the controller. If the memory is less than in slurm_node.conf then the controller will mark the node as invalid. You can confirm this by searching for the node in /var/log/slurm/slurmctld.log on the controller. If this happens, fix the memory in slurm_nodes.conf and restart slurmctld.

systemctl restart slurmctld

Then reboot the node.

Another cause of this is a hung process on the compute node. To clear this out, connect to the slurm controller and mark the node down, resume, and then idle.

scontrol update node NODENAME state=DOWN reason=hung
scontrol update node NODENAME state=RESUME
scontrol update node NODENAME state=IDLE