Job Manager is not responding
The job manager, SLURM is not responding to request to start new jobs. Submit a job request and SLURM responds with a job number but the job is never started nor shows
Check logs
[2022-05-17T14:29:12.809] error: Unable to resolve "NODE_HOSTNAME_1": Unknown host
[2022-05-17T14:29:12.809] error: _set_slurmd_addr: failure on NODE_HOSTNAME_1
[2022-05-17T14:29:13.591] error: Unable to resolve "NODE_HOSTNAME_2": Unknown host
[2022-05-17T14:29:13.591] error: _set_slurmd_addr: failure on NODE_HOSTNAME_2
2. Check /etc/hosts
to see if logs host name is in there
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.10.0.1 graphical_node
172.10.0.2 NODE_HOSTNAME_3
172.10.0.3 NODE_HOSTNAME_4
172.10.0.4 NODE_HOSTNAME_5
3. Check nodes with:
sinfo
Should return:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 5 idle~ NODE_HOSTNAMES[1]
debug* up infinite 10 alloc NODE_HOSTNAMES[2]
4. Check if the slurmctld
is up and running
sudo systemctl status slurmctld
5. If the service is not running properly restart it or start/stop it
sudo systemctl restart slurmctld
# OR
sudo systemctl stop slurmctld
sudo systemctl start slurmctld
6. Check if service is running and the servers are up and processing jobs
# Check service:
sudo systemctl status slurmctld
# Check nodes:
sinfo
# Check jobs
squeue
Last updated