The job manager, SLURM is not responding to request to start new jobs. Submit a job request and SLURM responds with a job number but the job is never started nor shows
Check logs
[2022-05-17T14:29:12.809] error: Unable to resolve "NODE_HOSTNAME_1": Unknown host
[2022-05-17T14:29:12.809] error: _set_slurmd_addr: failure on NODE_HOSTNAME_1
[2022-05-17T14:29:13.591] error: Unable to resolve "NODE_HOSTNAME_2": Unknown host
[2022-05-17T14:29:13.591] error: _set_slurmd_addr: failure on NODE_HOSTNAME_2
2. Check /etc/hosts to see if logs host name is in there
Note that the nodes cannot be seen in the file. This can be due to the Azure VMSS nodes restarting and not having the same name if it's hosted on an Azure VMSS
3. Check nodes with:
sinfo
Should return:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 5 idle~ NODE_HOSTNAMES[1]
debug* up infinite 10 alloc NODE_HOSTNAMES[2]
4. Check if the slurmctld is up and running
sudo systemctl status slurmctld
5. If the service is not running properly restart it or start/stop it