Content-type: text/html
Manpage of RESS
RESS
Section: Maintenance Commands (8)
Updated: SkyForm AIP Version 10.25.0 - April 2025
Index
Return to Main Contents
NAME
ress - Resource Sensor (RESS) for the SkyForm AIP system
SYNOPSIS
CB_SERVERDIR/ress.[master|host|hostname]
DESCRIPTION
RESS is a customized resource plugin for the AIP Load Server (cbls).
It provides customized resoure information to the the AIP system for jobs to
consume. It is a custom written program that is called by CBLS to obtain site-specific
resource data.
RESS must reside in the directory of CB_SERVERDIR and must be executable.
RESS must have one of the following file names:
- ress.master which will be executed by CBLS on the master host.
- ress.host which will be executed by CBLS on every AIP host.
- ress.hostname which will be executed by CBLS on the host with the host name of
hostname
CBLS periodically looks for exexutable names ress.master, ress.host, and ress.hostname
in the directory of $CB_SERVDIR, execute them automatically if any of them exists.
RESS CODE LOGIC
RESS code logic should be a loop that periodically update resource data.
Within each loop, it provides resource data in YAML format on its stdout,
then sleep for a number of seconds. Then it should update the resource again.
For a resource that is not frequently updated, the sleep time between two loops could be long, for example,
a number of hours.
The shortest data update time is 5 seconds.
The resource data it provides must be in YAML format. The detailed format is explained in the
next section.
Example of a RESS code, ress.master:
#!/bin/bash
while true; do
echo "- resource: clksnd"
echo " description: The second value of the current clock"
echo " type: number"
echo " direction: increase"
echo " value:" `date +%S`
echo " locale: master node01"
echo "---"
sleep 10
done
OUTPUT FORMAT
For each loop, the RESS output in stdout must be a complete YAML document. The document should start with a sequence and end with a separate line
of "---". This tells CBLS to stop reading the content and wait for the next loop.
Each sequence describes one resource. There could be multiple resources described.
Attributes of each resource are:
- resource:
-
Defines the name of the resource. The resource name should not be longer than 32 characters. It must start with letter and cannot contain any
character of .!-=+*/[]@:&|{}'`\".
- description:
-
The description of the resource. It could contain any character. The total length must be 255 characters of shorter.
- type:
-
The resource type. The valid values are: number, which means the resource value is numeric, either integer or floating point; text,
which means the resource value is a free text that is shorter than 32 characters; tag, which means the resource not value and the resource
name is served as a tag for the host. Jobs then can select host by using the tag, i.e. the resource name.
- direction:
-
The direction indicates the moving direction of the resource from the best to the worse. It takes the value of "increase" or "decrease". The value
of "increase" means the smaller the number is, the more resource the system has. For example, CPU utilization is an "increase" resource. The
value of "decrease" means the large the number is, the more resource the system has. For example, free memory is an "decrease" resource.
This parameter only applies to the resource with the type of "number".
- release:
-
It takes value "yes" or "no". When release is specified as "yes", the resource is released when the job
that requests the resource is suspended. The default value is "no", i.e. the
resource is not released upon job suspension.
- assign:
-
For "number" resource only.
It takes value "yes" or "no". When assign is specified as "yes", the resource unit is enumerated, and the
scheduler assigns specific resource unit(s) to the job that requests the resource. For example, the resource
gpu has "assign" to set as "yes". If a job requests 2 gpu resource on host, the scheduler will assign
speicif GPU number, for example GPU 2 and GPU 3 to the job. This way, the job knows which GPUs to use to avoid
conflict with other jobs that requests GPU on the same host. The default value
is "no", i.e. the resource unit is not assigned for each task (job slot).
- slotresource:
-
For "number" resource only. If the value is set to "yes", the total amount
of resource reserved for the job is the requested value in job resource
requirement times the number of slots scheduled for the job. The default
value is "no", i.e. the total amount of resource reserved for the job
is the same as what specified in the resource requirement.
- value:
-
The current value of the resource. The value could be a number for "number" resource, or text string for "text" resource. The value has no
meaning for "tag" resource.
- locale
-
Indicates the location of the resource reported. If this parameter is missing, the system considers the resource value is for the host that
runs the RESS as the location. The location is a list of hosts that shares the resource. For host base resource, this parameter can be either
the host name where the resource value is reported, or absent. For cluster wide resource, use the reserved word "all". If a resource is shared
by a number of hosts, list host names of these hosts spearated by spaces.
in minutes.
- WARNING: locale must be the last field in the yaml output.
-
EXAMPLE
The following example reports two resources: a "number" resource and a "text" resource. The "locale" of both resources is the host
that runs the ress.hostname.
#!/bin/bash
while true; do
echo "- resource: gpu1"
echo " description: GPU utilization"
echo " type: number"
echo " direction: increase"
echo " value: 0.8
echo " release: yes"
echo "- resource: network"
echo " description: my network"
echo " type: text"
echo " value: IB"
echo "---"
sleep 10
done
Index
- NAME
-
- SYNOPSIS
-
- DESCRIPTION
-
- RESS CODE LOGIC
-
- OUTPUT FORMAT
-
- EXAMPLE
-
This document was created by
man2html,
using the manual pages.
Time: 18:57:47 GMT, April 23, 2025