On-Premises Integration

The Slurm cluster can also be configured to manage on-premises compute nodes. The user must configure the on-premises compute nodes and then give the configuration information.

Network Requirements

The on-prem network must have a CIDR range that doesn't overlap the Slurm cluster's VPC and the two networks need to be connected using VPN or AWS Direct Connect. The on-prem firewall must allow ingress and egress from the VPC. The ports are used to connect to the file systems, slurm controllers, and allow traffic between virtual desktops and compute nodes.

DNS Requirements

Local network DNS must have an entry for the slurm controller or have a forwarding rule to the AWS provided DNS in the Slurm VPC.

File System Requirements

All of the compute nodes in the cluster, including the on-prem nodes, must have file system mounts that replicate the same directory structure. This can involve mounting filesystems across VPN or Direct Connect or synchronizing file systems using tools like rsync or NetApp FlexCache or SnapMirror. Performance will dictate the architecture of the file system.

The onprem compute nodes must mount the Slurm controller's NFS export so that they have access to the Slurm binaries and configuration file. They must then be configured to run slurmd so that they can be managed by Slurm.

Slurm Configuration of On-Premises Compute Nodes

The slurm cluster's configuration file allows the configuration of on-premises compute nodes. The Slurm cluster will not provision any of the on-prem nodes, network, or firewall, but it will configure the cluster's resources to be used by the on-prem nodes. All that needs to be configured are the configuration file for the on-prem nodes and the CIDR block.

  InstanceConfig:
    OnPremComputeNodes:
      ConfigFile: 'slurm_nodes_on_prem.conf'
      CIDR: '10.1.0.0/16'

slurm_nodes_on_prem.conf

#
# ON PREMISES COMPUTE NODES
#
# Config file with list of statically provisioned on-premises compute nodes that
# are managed by this cluster.
#
# These nodes must be addressable on the network and firewalls must allow access on all ports
# required by slurm.
#
# The compute nodes must have mounts that mirror the compute cluster including mounting the slurm file system
# or a mirror of it.
NodeName=Default State=DOWN

NodeName=onprem-c7-x86-t3-2xl-0 NodeAddr=onprem-c7-x86-t3-2xl-0.onprem.com  CPUs=4  RealMemory=30512   Feature=c7,CentOS_7_x86_64,x86_64,GHz:2.5 Weight=1
NodeName=onprem-c7-x86-t3-2xl-1 NodeAddr=onprem-c7-x86-t3-2xl-1.onprem.com  CPUs=4  RealMemory=30512   Feature=c7,CentOS_7_x86_64,x86_64,GHz:2.5 Weight=1
NodeName=onprem-c7-x86-t3-2xl-2 NodeAddr=onprem-c7-x86-t3-2xl-2.onprem.com  CPUs=4  RealMemory=30512   Feature=c7,CentOS_7_x86_64,x86_64,GHz:2.5 Weight=1
NodeName=onprem-c7-x86-t3-2xl-3 NodeAddr=onprem-c7-x86-t3-2xl-3.onprem.com  CPUs=4  RealMemory=30512   Feature=c7,CentOS_7_x86_64,x86_64,GHz:2.5 Weight=1
NodeName=onprem-c7-x86-t3-2xl-4 NodeAddr=onprem-c7-x86-t3-2xl-4.onprem.com  CPUs=4  RealMemory=30512   Feature=c7,CentOS_7_x86_64,x86_64,GHz:2.5 Weight=1
NodeName=onprem-c7-x86-t3-2xl-5 NodeAddr=onprem-c7-x86-t3-2xl-5.onprem.com  CPUs=4  RealMemory=30512   Feature=c7,CentOS_7_x86_64,x86_64,GHz:2.5 Weight=1
NodeName=onprem-c7-x86-t3-2xl-6 NodeAddr=onprem-c7-x86-t3-2xl-6.onprem.com  CPUs=4  RealMemory=30512   Feature=c7,CentOS_7_x86_64,x86_64,GHz:2.5 Weight=1
NodeName=onprem-c7-x86-t3-2xl-7 NodeAddr=onprem-c7-x86-t3-2xl-7.onprem.com  CPUs=4  RealMemory=30512   Feature=c7,CentOS_7_x86_64,x86_64,GHz:2.5 Weight=1
NodeName=onprem-c7-x86-t3-2xl-8 NodeAddr=onprem-c7-x86-t3-2xl-8.onprem.com  CPUs=4  RealMemory=30512   Feature=c7,CentOS_7_x86_64,x86_64,GHz:2.5 Weight=1
NodeName=onprem-c7-x86-t3-2xl-9 NodeAddr=onprem-c7-x86-t3-2xl-9.onprem.com  CPUs=4  RealMemory=30512   Feature=c7,CentOS_7_x86_64,x86_64,GHz:2.5 Weight=1

#
#
# OnPrem Partition
#
# The is the default partition and includes all nodes from the 1st OS.
#
PartitionName=onprem Default=YES PriorityTier=20000 Nodes=\
onprem-c7-x86-t3-2xl-[0-9]

#
# Always on partitions
#
SuspendExcParts=onprem

Simulating an On-Premises Network Using AWS

Create a new VPC with public and private subnets and NAT gateways. To simulate the latency between an AWS region and on-prem you can create the VPC in a different region in your account. The CIDR must not overlap with the Slurm VPC.

Create a VPC peering connection to your Slurm VPC and accept the connection in the Slurm VPC. Create routes in the private subnets for the CIDR of the peered VPC and route it to the vpc peering connection.

Add the on-prem VPC to the Slurm VPC's Route53 private local zone.

Create a Route53 private hosted zone for the on-prem compute nodes and add it to the onprem VPC and the slurm VPC so that onprem compute nodes can be resolved.

Copy the Slurm AMIs to the region of the on-prem VPC. Create an instance using the copied AMI. Connect to the instance and confirm that the mount points mounted correctly. You will probably have to change the DNS names for the file systems to IP addresses. I created A records in the Route53 zone for the file systems so that if the IP addresses ever change in the future I can easily update them in one place without having to create a new AMI or updated any instances. Create a new AMI from the instance.

Create compute node instances from the new AMI and run the following commands on them get the slurmd daemon running so they can join the slurm cluster.

# Instance specific variables
hostname=onprem-c7-x86-t3-2xl-0

# Domain specific variables
onprem_domain=onprem.com

source /etc/profile.d/instance_vars.sh

# munge needs to be running before calling scontrol
/usr/bin/cp /opt/slurm/$ClusterName/config/munge.key /etc/munge/munge.key
systemctl enable munged
systemctl start munged

ipaddress=$(hostname -I)
$SLURM_ROOT/bin/scontrol update nodename=${hostname} nodeaddr=$ipaddress

# Set hostname
hostname_fqdn=${hostname}.${onprem_domain}
if [ $(hostname) != $hostname_fqdn ]; then
    hostnamectl --static set-hostname $hostname_fqdn
    hostnamectl --pretty set-hostname $hostname
fi

if [ -e /opt/slurm/${ClusterName}/config/users_groups.json ] && [ -e /opt/slurm/${ClusterName}/bin/create_users_groups.py ]; then
    /opt/slurm/${ClusterName}/bin/create_users_groups.py -i /opt/slurm/${ClusterName}/config/users_groups.json
fi

# Create directory for slurmd.log
logs_dir=/opt/slurm/${ClusterName}/logs/nodes/${hostname}
if [[ ! -d $logs_dir ]]; then
    mkdir -p $logs_dir
fi
if [[ -e /var/log/slurm ]]; then
    rm -rf /var/log/slurm
fi
ln -s $logs_dir /var/log/slurm

systemctl enable slurmd
systemctl start slurmd
# Restart so that log file goes to file system
systemctl restart spot_monitor