submitit

A less horrible way to parallelize on HPC clusters

March 9, 2018 — February 14, 2024

computers are awful

computers are awful together

concurrency hell

distributed

premature optimization

My current go-to option for python HPC jobs. If anyone has ever told you to run some code by creating a bash script full of esoteric cruft like #SBATCH -N 10, then you have experience how user-hostile classic computer clusters are.

submitit makes all that nonsense go away and keeps the process of executing code on the cluster much more like classic parallel code execution on python, which is to say, annoying but tolerable. Which is better than the wading-through-punctuation-swamp vibe of sumbitting jobs by vanilla SLURM or TORQUE

1 Examples

It looks like this in basic form:

import submitit

def add(a, b):
    return a + b

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(
    timeout_min=1, slurm_partition="dev",
    tasks_per_node=4  # number of cores
)
job = executor.submit(add, 5, 7)  # will compute add(5, 7)
print(job.job_id)  # ID of your job

output = job.result()  # waits for completion and returns output
assert output == 12  # 5 + 7 = 12...  your addition was computed in the cluster

The docs could be better. Usage is by example::

Here is a more advanced pattern I use when running a bunch of experiments via submitit:

import bz2
import cloudpickle

job_name = "my_cool_job"

def expensive_calc(a):
    return a + 1

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(
    timeout_min=1,
    slurm_partition="dev", #misc slurm args
    tasks_per_node=4  # number of cores
    mem=8,  # memory in GB I think
)

#submit 10 jobs:
jobs = [executor.submit(expensive_calc, num) for num in range(10)]

# but actually maybe we want to come back to those jobs later, so let’s save them to disk
with bz2.open(job_name + ".job.pkl.bz2", "wb") as f:
    cloudpickle.dump(jobs, f)

# We can quit this python session and do something else now
# Resume after quitting
with bz2.open(job_name + ".job.pkl.bz2", "rb") as f:
    jobs = cloudpickle.load(f)

# wait for all jobs to finish
[job.wait() for job in jobs]

res_list = []
# # Alternatively append these results to a previous run
# with bz2.open(job_name + ".pkl.bz2", "rb") as f:
#     res_list.extend(pickle.load(f))

# Examine job outputs for failures etc
fail_ids = [job.job_id for job in jobs if job.state not in  ('DONE', 'COMPLETED')]
res_list.extend([job.result() for job in jobs if job.job_id not in fail_ids ])
failures = [job for job in jobs if job.job_id in fail_ids]

if failures:
    print("failures")
    print("===")
    for job in failures:
        print(job.state, job.stderr())

# Save results to disk
with bz2.open(job_name + ".result.pkl.bz2", "wb") as f:
    cloudpickle.dump(res_list, f)

Cool feature: the spawning script only need to survive as long as it takes to put jobs on the queue, and then it can die. Later on we can reload those jobs from disk as if it all happened in the same session.

2 Gotchas

2.1 DebugExecutor is not actually useful

When developing a function using submitit is might seem convenient to use DebugExecutor sometimes, to work out why things are crashing. I cannot work out when it is actually useful. For one, DebugExecutor executes sequentially in-process so any failure of complexity that arises from concurrency will not occur there. If we don’t want to debug concurrency problems, but just the function itself, we should just execute the function in the normal way. executor=DebugExecutor();executor.submit(func, *args, **kwargs) does not get us anything new that we do not already get from running func(*args, **kwargs) directly.

There is an exception: Debug Executor does get us one small extra thing, and it is a thing we do not want. This DebugExecutor always invokes pdb, or ipd, the interactive python debuggers, on error. As such they cannot be run from VS Code, because VS Code does not support those interactive debuggers.

2.2 Weird argument names

submitit believes that the slurm_mem argument requisitions memory in GB, but slurm interprets it as MB per default (see --mem). there is an alternate argument mem_gb which allocates GB of memory.

3 hydra

Pro tip: Submitit integrates with hydra.