Submitting Jobs using GAME modules#
When jobs are submitted to a HPC cluster we cannot control which jobs will be started first which can create complications when trying to create server-client connections since the server must be started first. The following documentation outlines how to launch jobs using Evaluator and Predictor containers.
Predictor server saves its HOST/PORT and Evaluator client reads it in (recommended)#
The Predictor job script creates a
.txtfile that contains the HOST and PORT that the Predictor will run on.The Evaluator will run a while loop that checks and waits till this
.txtfile exists which signals that the Predictor has started running and also communicates the HOST and PORT it should connect to.The Predictor reads in the HOST and PORT and passes those into the
apptainer runcommand.
We recommend this approach as it should work across all HPC systems and schedulers. Sample scripts can be found here: [LINK]
Notes:
In some HPC platforms GPU nodes are isolated from CPU nodes. In this case the Evaluator must also be running on a GPU node to be able to connect to a Predictor.
Sometimes there are multiple IP addresses for a node and not all of them have public access. These are system-specific and a user should double-check this. In our HPC system the second IP address in the
hostname -Ilist is always public and thus we extract this to use for the server-client connection.From the time the Predictor creates the
.txtfile till when the endpoint is exposed for connection, there can be a slight delay. This delay can result in the Evaluator trying to connect to a connection that hasn’t started yet. To mitigate this the client includes a re-try loop to connect to the server once it’s up and running.