Configuraton File Format

This project creates a ParallelCluster configuration file that is documented in the ParallelCluster User Guide.

termination_protection: bool
StackName: str
Region: str
SshKeyPair: str
VpcId: str
CIDR: str
SubnetId: str
ErrorSnsTopicArn: str
TimeZone: str
AdditionalSecurityGroupsStackName: str
RESStackName: str
ExternalLoginNodes:
    - Tags:
          - Key: str
            Values: [ str ]
      SecurityGroupId: str
DomainJoinedInstance:
    - Tags:
          - Key: str
            Values: [ str ]
      SecurityGroupId: str
slurm:
    ParallelClusterConfig:
        Version: str
        ClusterConfig: dict
        Image:
            Os: str
            CustomAmi: str
        Architecture: str
        ComputeNodeAmi: str
        EnableEfa: bool
        Database:
            DatabaseStackName: str
            FQDN: str
            Port: str
            AdminUserName: str
            AdminPasswordSecretArn: str
            ClientSecurityGroup:
                SecurityGroupName: SecurityGroupId
        Slurmdbd:
            SlurmdbdStackName: str
            Host: str
            Port: str
            ClientSecurityGroup: str
        Dcv:
            Enabled: bool
            Port: int
            AllowedIps: str
        LoginNodes:
            Pools:
            - Name: str
                Count: int
                InstanceType: str
                GracetimePeriod: int
                Image:
                    CustomAmi: str
                Ssh:
                    KeyName: str
                Networking:
                    SubnetIds:
                        - str
                    SecurityGroups:
                        - str
                    AdditionalSecurityGroups:
                        - str
                Iam:
                    InstanceRole: str
                    InstanceProfile: str
                    AdditionalIamPolicies:
                    - Policy: str
    ClusterName: str
    MungeKeySecret: str
    SlurmCtl:
        SlurmdPort: int
        instance_type: str
        volume_size: str
        CloudWatchPeriod: int
        PreemptMode: str
        PreemptType: str
        PreemptExemptTime: str
        SlurmConfOverrides: str
        SlurmrestdUid: str
        AdditionalSecurityGroups:
        - str
        AdditionalIamPolicies:
        - str
        Imds:
            Secured: bool
    InstanceConfig:
        UseOnDemand: str
        UseSpot: str
        DisableSimultaneousMultithreading: str
        Exclude:
            InstanceFamilies:
            - str
            InstanceTypes:
            - str
        Include:
            MaxSizeOnly: bool
            InstanceFamilies:
            - str
            - str:
                useOnDemand: bool
                UseSpot: bool
                DisableSimultaneousMultithreading: bool
            InstanceTypes:
            - str
            - str:
                UseOnDemand: bool
                UseSpot: bool
                DisableSimultaneousMultithreading: bool
        NodeCounts:
            DefaultMinCount: str
            DefaultMaxCount: str
            ComputeResourceCounts:
                str: # ComputeResourceName
                    MinCount: int
                    MaxCount: int
        AdditionalSecurityGroups:
        - str
        AdditionalIamPolicies:
        - str
        OnPremComputeNodes:
            ConfigFile: str
            CIDR: str
            Partition: str
    SlurmUid: int
    storage:
        ExtraMounts:
        - dest: str
          src: str
          type: str
          options: str
          StorageType: str
          FileSystemId: str
          VolumeId: str
    Licenses:
        LicenseName:
            Count: int
            Server: str
            Port: str
            ServerType:
            StatusScript:

Top Level Config

termination_protection

Enable Cloudformation Stack termination protection

default=True

StackName

The name of the configuration stack that will configure ParallelCluster and deploy it.

If you do not specify the ClusterName then it will default to a value based on the StackName. If StackName ends in -config then ClusterName will be the StackName with -config stripped off. Otherwise it will be the StackName with -cl (for cluster) appended.

Optional so can be specified on the command-line

default='slurm-config'

Region

AWS region where the cluster will be deployed.

Optional so can be specified on the command-line

SshKeyPair

Default EC2 key pair that will be used for all cluster instances.

Optional so can be specified on the command-line

VpcId

The ID of the VPC where the cluster will be deployed.

Optional so can be specified on the command-line

CIDR

The CIDR of the VPC. This is used in security group rules.

SubnetId

The ID of the VPC subnet where the cluster will be deployed.

Optional. If not specified then the first private subnet is chosen. If no private subnets exist, then the first isolated subnet is chosen. If no isolated subnets exist, the the first public subnet is chosen.

We recommend using a private or isolated subnet.

ErrorSnsTopicArn

The ARN of an existing SNS topic. Errors will be published to the SNS topic. You can subscribe to the topic so that you are notified for things like script or lambda errors.

Optional, but highly recommended

TimeZone

The time zone to use for all EC2 instances in the cluster.

default='US/Central'

AdditionalSecurityGroupsStackName

If you followed the automated process to create security groups for external login nodes and file systems, then specify the stack name that you deployed and the additional security groups will be configured for the head and compute nodes.

RESStackName

If you are deploying the cluster to use from Research and Engineering Studio (RES) virtual desktops, then you can specify the stack name for the RES environment to automate the integration. The virtual desktops automatically get configured to use the cluster.

This requires you to configure security groups for external login nodes.

The Slurm binaries will be compiled for the OS of the desktops and and environment modulefile will be created so that the users just need to load the cluster modulefile to use the cluster.

ExternalLoginNodes

An array of specifications for instances that should automatically be configured as Slurm login nodes. Each array element contains one or more tags that will be used to select login node instances. It also includes the security group id that must be attached to the login node to give it access to the slurm cluster. The tags for a group of instances is an array with the tag name and an array of values.

A lambda function processes each login node specification. It uses the tags to select running instances. If the instances do not have the security group attached, then it will attach the security group. It will then run a script each instance to configure it as a login node for the slurm cluster. To use the cluster, users simply load the environment modulefile that is created by the script.

For example, to configure RES virtual desktops as Slurm login nodes the following configuration is added.

---
ExternalLoginNodes:
- Tags:
  - Key: 'res:EnvironmentName'
    Values: [ 'res-eda' ]
  - Key: 'res:NodeType'
    Values: ['virtual-desktop-dcv-host']
  SecurityGroupId: <SlurmLoginNodeSGId>

DomainJoinedInstance

A specifications for a domain joined instance that will be used to create and update users_groups.json. It also includes the security group id that must be attached to the login node to give it access to the slurm head node so it can mount the slurm configuration file system. The tags for the instance is an array with the tag name and an array of values.

A lambda function the specification. It uses the tags to select a running instance. If the instance does not have the security group attached, then it will attach the security group. It will then run a script each instance to configure it to save all of the users and groups into a json file that is used to create local users and groups on compute nodes when they boot.

For example, to configure the RES cluster manager, the following configuration is added.

---
DomainJoinedInstance:
- Tags:
  - Key: 'Name'
    Values: [ 'res-eda-cluster-manager' ]
  - Key: 'res:EnvironmentName'
    Values: [ 'res-eda' ]
  - Key: 'res:ModuleName'
    Values: [ 'cluster-manager' ]
  - Key: 'res:ModuleId'
    Values: [ 'cluster-manager' ]
  - Key: 'app'
    Values: ['virtual-desktop-dcv-host']
  SecurityGroupId: <SlurmLoginNodeSGId>

slurm

Slurm configuration parameters.

ParallelClusterConfig

ParallelCluster specific configuration parameters.

Version

The ParallelCluster version.

This is required and cannot be changed after the cluster is created.

Updating to a new version of ParallelCluster requires either deleting the current cluster or creating a new cluster.

ClusterConfig

type: dict

Additional ParallelCluster configuration settings that will be directly added to the configuration without checking.

This will will be used to create the initial ParallelCluster configuration and other settings in this configuration file will override values in the dict.

This exists to enable further customization of ParallelCluster beyond what this configuration supports.

The cluster configuration format is documented in the ParallelCluster User Guide.

For example, if you want to change the ScaledownIdletime, you would add the following to your config file.

slurm:
  ParallelClusterConfig:
    ClusterConfig:
      Scheduling:
        SlurmSettings:
          ScaledownIdletime: 20

Image

The OS and AMI to use for the head node and compute nodes.

OS

See the ParallelCluster docs for the supported OS distributions and versions.

CustomAmi

See the ParallelCluster docs for the custom AMI documentation.

NOTE: A CustomAmi must be provided for Rocky8 or Rocky9. All other distributions have a default AMI that is provided by ParallelCluster.

Architecture

The CPU architecture to use for the cluster.

ParallelCluster doesn't support heterogeneous clusters. All of the instances must have the same CPU architecture and the same OS.

The cluster, however, can be accessed from login nodes of any architecture and OS.

Valid Values:

  • arm64
  • x86_64

default: x86_64

ComputeNodeAmi

AMI to use for compute nodes.

All compute nodes will use the same AMI.

The default AMI is selected by the Image parameters.

EnableEfa

type: bool

default: False

Recommend to not use EFA unless necessary to avoid insufficient capacity errors when starting new instances in group or when multiple instance types in the group.

See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html#placement-groups-cluster

Database

Optional

Note: Starting with ParallelCluster 3.10.0, you should use slurm/ParallelClusterConfig/Slurmdbd instead of slurm/ParallelClusterConfig/Database. You cannot have both parameters.

Configure the Slurm database to use with the cluster.

This is created independently of the cluster so that the same database can be used with multiple clusters.

See Create ParallelCluster Slurm Database on the deployment prerequisites page.

If you used the CloudFormation template provided by ParallelCluster, then the easiest way to configure it is to pass the name of the stack in slurm/ParallelClusterConfig/Database/DatabaseStackName. All of the other parameters will be pulled from the outputs of the stack.

See the ParallelCluster documentation.

DatabaseStackName

Name of the ParallelCluster CloudFormation stack that created the database.

The following parameters will be set using the outputs of the stack:

  • FQDN
  • Port
  • AdminUserName
  • AdminPasswordSecretArn
  • ClientSecurityGroup
FQDN

Used with the Port to set the Uri of the database.

Database: Port

type: int

Database's port.

AdminUserName

type: str

The identity that Slurm uses to connect to the database, write accounting logs, and perform queries. The user must have both read and write permissions on the database.

Sets the UserName parameter in ParallelCluster.

AdminPasswordSecretArn

type: str

The Amazon Resource Name (ARN) of the AWS Secrets Manager secret that contains the AdminUserName plaintext password. This password is used together with AdminUserName and Slurm accounting to authenticate on the database server.

Sets the PasswordSecretArn parameter in ParallelCluster.

Database: ClientSecurityGroup

Security group that has permissions to connect to the database.

Required to be attached to the head node that is running slurmdbd so that the port connection to the database is allowed.

Slurmdbd

Note: This is not supported before ParallelCluster 3.10.0. If you specify this parameter then you cannot specify slurm/ParallelClusterConfig/Database.

Optional

Configure an external Slurmdbd instance to use with the cluster. The Slurmdbd instance provides access to the shared Slurm database. This is created independently of the cluster so that the same database can be used with multiple clusters.

This is created independently of the cluster so that the same slurmdbd instance can be used with multiple clusters.

See Create Slurmdbd instance on the deployment prerequisites page.

If you used the CloudFormation template provided by ParallelCluster, then the easiest way to configure it is to pass the name of the stack in slurm/ParallelClusterConfig/Database/SlurmdbdStackName. All of the other parameters will be pulled from the parameters and outputs of the stack.

See the ParallelCluster documentation for ExternalSlurmdbd.

SlurmdbdStackName

Name of the ParallelCluster CloudFormation stack that created the Slurmdbd instance.

The following parameters will be set using the outputs of the stack:

  • Host
  • Port
  • ClientSecurityGroup
Slurmdbd: Host

IP address or DNS name of the Slurmdbd instance.

Slurmdbd: Port

Default: 6819

Port used by the slurmdbd daemon on the Slurmdbd instance.

Slurmdbd: ClientSecurityGroup

Security group that has access to use the Slurmdbd instance. This will be added as an extra security group to the head node.

ClusterName

Name of the ParallelCluster cluster.

Default: If StackName ends with "-config" then ClusterName is StackName with "-config" stripped off. Otherwise add "-cl" to end of StackName.

MungeKeySecret

AWS secret with a base64 encoded munge key to use for the cluster. For an existing secret can be the secret name or the ARN. If the secret doesn't exist one will be created, but won't be part of the cloudformation stack so that it won't be deleted when the stack is deleted. Required if your login nodes need to use more than 1 cluster.

See Create Munge Key for more details.

SlurmCtl

Configure the Slurm head node or controller.

Required, but can be an empty dict to accept all of the defaults.

SlurmdPort

Port used for the slurmd daemon on the compute nodes.

default=6818

type: int

instance_type

Instance type of the head node.

Must match the architecture of the cluster.

volume_size

The size of the EBS root volume on the head node in GB.

default=200

type: int

CloudWatchPeriod

The frequency of CloudWatch metrics in seconds.

default=5

type: int

PreemptMode

Set job preemption policy for the cluster.

Jobs can be set to be preemptible when they are submitted. This allows higher priority jobs to preempt a running job when resources are constrained. This policy sets what happens to the preempted jobs.

Slurm documentation

Valid values:

  • 'OFF'
  • 'CANCEL'
  • 'GANG'
  • 'REQUEUE'
  • 'SUSPEND'

default='REQUEUE'

PreemptType

Slurm documentation

Valid values:

  • 'preempt/none'
  • 'preempt/partition_prio'
  • 'preempt/qos'

default='preempt/partition_prio'

PreemptExemptTime

Slurm documentation

Global option for minimum run time for all jobs before they can be considered for preemption.

A time of -1 disables the option, equivalent to 0. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes", and "days-hours:minutes:seconds".

default='0'

type: str

SlurmConfOverrides

File that will be included at end of slurm.conf to override configuration parameters.

This allows you to customize the slurm configuration arbitrarily.

This should be used with caution since it can result in errors that make the cluster non-functional.

type: str

SlurmrestdUid

User ID for the slurmrestd daemon.

type: int

default=901

SlurmRestApiVersion

The REST API version.

This is automatically set based on the Slurm version being used by the ParallelCluster version.

type: str

default: ''0.0.39'

Head Node AdditionalSecurityGroups

Additional security groups that will be added to the head node instance.

Head Node AdditionalIamPolicies

List of Amazon Resource Names (ARNs) of IAM policies for Amazon EC2 that will be added to the head node instance.

InstanceConfig

Configure the instances used by the cluster for compute nodes.

ParallelCluster is limited to a total of 50 compute resources and we only put 1 instance type in each compute resource. This limits you to a total of 50 instance types per cluster. If you need more instance types than that, then you will need to create multiple clusters. If you configure both on-demand and spot for each instance type, then the limit is effectively 25 instance types because 2 compute resources will be created for each instance type.

If you configure more than 50 instance types then the installer will fail with an error. You will then need to modify your configuration to either include fewer instance types or exclude instance types from the configuration.

If no Include and Exclude parameters are specified then default EDA instance types will be configured with both On-Demand and Spot Instances configured.. The defaults will include the latest generation instance families in the c, m, r, x, and u families. Older instance families are excluded. Metal instance types are also excluded. Specific instance types are also excluded to keep the total number of instance types under 50. If multiple instance types have the same amount of memory, then the instance types with the highest core counts are excluded. This is because EDA workloads are typically memory limited, not core limited.

If any Include or Exclude parameters are specified, then minimal defaults will be used for the parameters that aren't specified. By default, all instance families are included and no specific instance types are included. By default, all instance types with less than 4 GiB of memory are excluded because they don't have enough memory for a Slurm compute node.

If no includes or excludes are provided, the defaults are:

slurm:
  InstanceConfig:
    Exclude:
      InstanceFamilies:
      - 'a1'   # Graviton 1
      - 'c4'   # Replaced by c5
      - 'd2'   # SSD optimized
      - 'g3'   # Replaced by g4
      - 'g3s'  # Replaced by g4
      - 'h1'   # SSD optimized
      - 'i3'   # SSD optimized
      - 'i3en' # SSD optimized
      - 'm4'   # Replaced by m5
      - 'p2'   # Replaced by p3
      - 'p3'
      - 'p3dn'
      - 'r4'   # Replaced by r5
      - 't2'   # Replaced by t3
      - 'x1'
      - 'x1e'
      InstanceTypes:
      - '.*\.metal'
      # Reduce the number of selected instance types to 25.
      # Exclude larger core counts for each memory size
      # 2 GB:
      - 'c7a.medium'
      - 'c7g.medium'
      # 4 GB: m7a.medium, m7g.medium
      - 'c7a.large'
      - 'c7g.large'
      # 8 GB: r7a.medium, r7g.medium
      - 'm5zn.large'
      - 'm7a.large'
      - 'm7g.large'
      - 'c7a.xlarge'
      - 'c7g.xlarge'
      # 16 GB: r7a.large, x2gd.medium, r7g.large
      - 'r7iz.large'
      - 'm5zn.xlarge'
      - 'm7a.xlarge'
      - 'm7g.xlarge'
      - 'c7a.2xlarge'
      - 'c7g.2xlarge'
      # 32 GB: r7a.xlarge, x2gd.large, r7g.xlarge
      - 'r7iz.xlarge'
      - 'm5zn.2xlarge'
      - 'm7a.2xlarge'
      - 'm7g.2xlarge'
      - 'c7a.4xlarge'
      - 'c7g.4xlarge'
      # 64 GB: r7a.2xlarge, x2gd.xlarge, r7g.2xlarge
      - 'r7iz.2xlarge'
      - 'm7a.4xlarge'
      - 'm7g.4xlarge'
      - 'c7a.8xlarge'
      - 'c7g.8xlarge'
      # 96 GB:
      - 'm5zn.6xlarge'
      - 'c7a.12xlarge'
      - 'c7g.12xlarge'
      # 128 GB: x2iedn.xlarge, r7iz.4xlarge, x2gd.2xlarge, r7g.4xlarge
      - 'r7a.4xlarge'
      - 'm7a.8xlarge'
      - 'm7g.8xlarge'
      - 'c7a.16xlarge'
      - 'c7g.8xlarge'
      # 192 GB: m5zn.12xlarge, m7a.12xlarge, m7g.12xlarge
      - 'c7a.24xlarge'
      # 256 GB: x2iedn.2xlarge, x2iezn.2xlarge, x2gd.4xlarge, r7g.8xlarge
      - 'r7iz.8xlarge'
      - 'r7a.8xlarge'
      - 'm7a.16xlarge'
      - 'm7g.16xlarge'
      - 'c7a.32xlarge'
      # 384 GB: r7iz.12xlarge, r7g.12xlarge
      - 'r7a.12xlarge'
      - 'm7a.24xlarge'
      - 'c7a.48xlarge'
      # 512 GB: x2iedn.4xlarge, x2iezn.4xlarge, x2gd.8xlarge, r7g.16xlarge
      - 'r7iz.16xlarge'
      - 'r7a.16xlarge'
      - 'm7a.32xlarge'
      # 768 GB: r7a.24xlarge, x2gd.12xlarge
      - 'x2iezn.6xlarge'
      - 'm7a.48xlarge'
      # 1024 GB: x2iedn.8xlarge, x2iezn.8xlarge, x2gd.16xlarge
      - 'r7iz.32xlarge'
      - 'r7a.32xlarge'
      # 1536 GB: x2iezn.12xlarge, x2idn.24xlarge
      - 'r7a.48xlarge'
      # 2048 GB: x2iedn.16xlarge
      - 'x2idn.32xlarge'
      # 3072 GB: x2iedn.24xlarge
      # 4096 GB: x2iedn.32xlarge
    Include:
      InstanceFamilies:
      - 'c7a'               # AMD EPYC 9R14 Processor 3.7 GHz
      - 'c7g'               # AWS Graviton3 Processor 2.6 GHz
      - 'm5zn'              # Intel Xeon Platinum 8252 4.5 GHz
      - 'm7a'               # AMD EPYC 9R14 Processor 3.7 GHz
      - 'm7g'               # AWS Graviton3 Processor 2.6 GHz
      - 'r7a'               # AMD EPYC 9R14 Processor 3.7 GHz
      - 'r7g'               # AWS Graviton3 Processor 2.6 GHz
      - 'r7iz'              # Intel Xeon Scalable (Sapphire Rapids) 3.2 GHz
      - 'x2gd'              # AWS Graviton2 Processor 2.5 GHz 1TB
      - 'x2idn'             # Intel Xeon Scalable (Icelake) 3.5 GHz 2 TB
      - 'x2iedn'            # Intel Xeon Scalable (Icelake) 3.5 GHz 4 TB
      - 'x2iezn'            # Intel Xeon Platinum 8252 4.5 GHz 1.5 TB
      - 'u.*'
      InstanceTypes: []

UseOnDemand

Configure on-demand instances. This sets the default for all included instance types. It can be overridden for included instance families and by instance types.

type: bool

default: True

UseSpot

Configure spot instances. This sets the default for all included instance types. It can be overridden for included instance families and by instance types.

type: bool

default: True

DisableSimultaneousMultithreading

type: bool

default=True

Disable SMT on the compute nodes. If true, multithreading on the compute nodes is disabled. This sets the default for all included instance types. It can be overridden for included instance families and by instance types.

Not all instance types can disable multithreading. For a list of instance types that support disabling multithreading, see CPU cores and threads for each CPU core per instance type in the Amazon EC2 User Guide for Linux Instances.

Update policy: The compute fleet must be stopped for this setting to be changed for an update.

ParallelCluster documentation

Exclude

Instance families and types to exclude.

Exclude patterns are processed first and take precedence over any includes.

Instance families and types are regular expressions with implicit '^' and '$' at the begining and end.

Exclude InstanceFamilies

Regular expressions with implicit '^' and '$' at the begining and end.

Default: []

Exclude InstanceTypes

Regular expressions with implicit '^' and '$' at the begining and end.

Default: []

Include

Instance families and types to include.

Exclude patterns are processed first and take precedence over any includes.

Instance families and types are regular expressions with implicit '^' and '$' at the begining and end.

Each element in the array can be either a regular expression string or a dictionary where the only key is the regular expression string and that has overrides UseOnDemand, UseSpot, and DisableSimultaneousMultithreading for the matching instance families or instance types.

The settings for instance families overrides the defaults, and the settings for instance types override the others.

For example, the following configuration defaults to only On-Demand instances with SMT disabled. It includes all of the r7a, r7i, and r7iz instance types. The r7a instances will only have On-Demand instances. The r7i and r7iz instance types will have spot instances except for the r7i.48xlarge which has spot disabled.

This allows you to control these attributes of the compute resources with whatever level of granularity that you need.

slurm:
  InstanceConfig:
    UseOnDemand: true
    UseSpot: false
    DisableSimultaneousMultithreading: true
    Exclude:
      InstanceTypes:
        - .*\.metal
    Include:
      InstanceFamilies:
        - r7a.*
        - r7i.*: {UseSpot: true}
      InstanceTypes:
        - r7i.48xlarge: {UseSpot: false}
MaxSizeOnly

type: bool

default: False

If MaxSizeOnly is True then only the largest instance type in a family will be included unless specific instance types are included.

Include InstanceFamilies

Regular expressions with implicit '^' and '$' at the begining and end.

Default: []

Include InstanceTypes

Regular expressions with implicit '^' and '$' at the begining and end.

Default: []

NodeCounts

Configure the number of compute nodes of each instance type.

DefaultMinCount

type: int

default: 0

Minimum number of compute nodes to keep running in a compute resource. If the number is greater than zero then static nodes will be created.

DefaultMaxCount

type: int

The maximum number of compute nodes to create in a compute resource.

ComputeResourceCounts

Define compute node counts per compute resource.

These counts will override the defaults set by DefaultMinCount and DefaultMaxCount.

ComputeResourceName

Name of the ParallelCluster compute resource. Can be found using sinfo.

# Compute Resource MinCount

type: int

default: 0

# Compute Resource MaxCount

type: int

Compute Node AdditionalSecurityGroups

Additional security groups that will be added to the compute node instances.

Compute Node AdditionalIamPolicies

List of Amazon Resource Names (ARNs) of IAM policies for Amazon EC2 that will be added to the compute node instances.

OnPremComputeNodes

Define on-premises compute nodes that will be managed by the ParallelCluster head node.

The compute nodes must be accessible from the head node over the network and any firewalls must allow all of the Slurm ports between the head node and compute nodes.

ParallelCluster will be configured to allow the neccessary network traffic and the on-premises firewall can be configured to match the ParallelCluster seccurity groups.

ConfigFile

Configuration file with the on-premises compute nodes defined in Slurm NodeName format as described in the Slurm slurm.conf documentation.

The file will be included in the ParallelCluster slurm.conf so it can technically include any Slurm configuration updates including custom partition definitions.

NOTE: The syntax of the file isn't checked and syntax errors can result in the slurmctld daemon failing on the head node.

On-Premises CIDR

The CIDR that contains the on-premises compute nodes.

This is to allow egress from the head node to the on-premises nodes.

Partition

A partition that will contain all of the on-premises nodes.

SlurmUid

type: int

default: 900

The user id of the slurm user.

storage

ExtraMounts

Additional mounts for compute nodes.

This can be used so the compute nodes have the same file structure as the remote desktops.

This is used to configure ParallelCluster SharedStorage.

For example:

storage:
    ExtraMounts:
    - dest: "/tools"
      StorageType: FsxOpenZfs
      VolumeId: 'fsvol-abcd1234'
      src: 'fs-efgh5678.fsx.us-east-1.amazonaws.com:/fsx/'
      type: nfs4
      options: 'nfsvers=4.1'
dest

The directory where the file system will be mounted.

This sets the MountDir.

src

The source path on the file system export that will be mounted.

type

The type of mount. For example, nfs3.

options

Mount options.

StorageType

The type of file system to mount.

Valid values:

  • Efs
  • FsxLustre
  • FsxOntap
  • FsxOpenZfs
FileSystemId

Specifies the ID of an existing FSx for Lustre or EFS file system.

VolumeId

Specifies the volume ID of an existing FSx for ONTAP or FSx for OpenZFS file system.

Licenses

Configure license counts for the scheduler.

If the Slurm database is configured then it will be updated with the license counts. Otherwise, the license counts will be added to slurm.conf.

LicenseName

The name of the license, for example, VCSCompiler_Net or VCSMXRunTime_Net. This is the license name that users specify when submitting a job. It doesn't have to match the license name reported by the license server, although that probably makes the most sense.

Count

The number of licenses available to Slurm to use to schedule jobs. Once all of the license are used by running jobs, then any pending jobs will remain pending until a license becomes available.

Server

The license server hosting the licenses.

Not currently used.

Port

The port on the license server used to request licenses.

Not currently used.

ServerType

The type of license server, such as FlexLM.

Not currently used.

StatusScript

A script that queries the license server and dynamically updates the Slurm database with the actual total number of licenses and the number used.

Not currently implemented.