SLURM Isolation


#1

Hi,

I have a question for specific use case. I have a SLURM (https://slurm.schedmd.com/) cluster and i want to run builds by using SLURM jobs. As far as I have seen, this would be possible by creating an isolation plugin.
SLURM is mainly used for allocating/schedule resources (e.g. srun --mem=23 --cpus-per-task=2 /run/some/job).

Now an addtional requirement would be to run these slurm jobs as a specific user (The one that started a build). Would this somehow be possible using SSH keys (that the user adds to a project)?

If that scenario seems reasonable/possible, I am glad to implement a plugin.

Best Regards
Matthias


#2

Hi,

I don't know anything about slurm, but assuming that needs to spin up a remote node to the do the job it is probably still possible, but I'm not sure it's a good fit because I don't know about slurm yet :)

What happens right now is the scm plugin runs to checkout the code on the worker, the checkout happens, then the isolation plugin runs the actual build.

If you wanted to simply copy those files to a remote note and then run the build on those new nodes and copy the results back, yes, that would technically work.

However, it sounds more like you might just want to keep some workers running on your cluster.

Another idea that has come up is the idea of having plugins that autoscale themselves based on the number of builds in queue for a given worker pool. If we were to do this we would probably have a new type of plugin ("autoscaling") of which you could configure AWS or OpenStack or whatever to scale up or down a worker pool automatically based on the number of queued jobs.

This could also take some parameters as to the minimum and maximum size.

In that configuration we could (optionally) also teach the worker a new flag where the worker just ones one job and then immediately exits or have some way for it to ask the overseeing worker whether it should spin down or not.

This type of autoscaling plugin does not currently exist. If that's more of what you are looking for I'd be willing to write a stub for it, even a basic one that just executes shell commands might be a start.

Technically you could use the isolation plugin infrastructure now but you'd still need a worker to do the checkouts and ship them off, and those workers would be idle until the job returned.

That might be ok, say if you have crazy long job times - you'd just have one vespene worker waiting on that slurm job to complete using the isolation plugin you wrote, until it came back.

As I'm not sure if you want more of that case or the other, let me know and we can see what we can do.

As slurm isn't all that common, it probably wouldn't be something I'd want to maintain in tree, but it's also something I think the plugin ecosystem could easily accomodate.


#3

A generic autoscaling plugin that can pass in some variables for the size of a thing based on the size of a worker pool queue, and send that to terraform might be the easiest way to accomplish this.

If slurm isn't terraform friendly it could still use the same infrastructure.

We could send it parameters like min and max size and the number of workers that should be allocated for each job that is in a given queue with each name.

The plugin could be run by a manage.py command as an additional supervisord process.

Thoughts?


#4

Example plugin config might look like this:

autoscaling = [ "vespene.plugins.autoscaling.executor", dict(
   worker_pools = dict(
       main_build_queue = dict(
           cmd = "terraform --blarg foo.tf --size={{ suggested_size }}", 
           min_size = 1,
           max_size = 10,
           jobs_per_worker = 2,
           re_evaluate_after_seconds = 120
       )
  )

)

then when you want to run the autoscaler, you just pick any node and run this (once) via supervisor:

python manage.py run_autoscaler 

This would look at all configured queues and keep scaling them every so many seconds

So if there were 4 jobs in the queue, the command could run and try to change the worker queue to size (4/2=2).

We could also add a flag to the worker that would cause it to die (optionally) after running one build, versus running in a daemon mode.

Set up the images for the workers themselves would be up to the reader, but I think this could be cool in that it would be 100% generic.

Hopefully people could contribute examples for AWS and other cloud providers.

(Note: the isolation plugin with an rsync will technically work instead of this new proposal, you'll just need some workers running all the time to feed jobs to that cluster. And if these are automation jobs that might not matter).

What's kind of cool about the proposed proposal (ahem) is that by allowing any command or script it would be super easy to customize, it's really just a command that sends commands when the queue size changes.


#5

Thanks already for that thorough answer.

First a bit of clarification regarding slurm:
In a slurm cluster normally all nodes share a common storage (e.g. NFS or whatever distributed filesystem). So when a worker is running on any of the nodes there is no need for copying files around.

When you start a job, slurm checks if there are resources available and start the job on a node where resources are available. If not the job is queued until resources are available again.

There are basically two modes you can run jobs. In interactive mode it's like you start a new shell using the allocated resources. In batch mode a script is submitted and you don't interact with it directly.

So about your autoscaling plugin:
Sounds pretty cool. Just to get your thoughts right you would run a new process which runs the autoscaler.
This checks for jobs in the queue and executing cmd, as soon a job is available and max_size is not already exceeded. The cmd then has to spin up a worker which does the work as usual.

So for slurm you normally just submit jobs and slurm executes it as soon as resources are available. With your proposed autoscaling plugin cmd could submit a slurm job, who then runs the worker. That would be awesome.

The login thing:
Slurm allocates resources on a user level, so every user has some quota (e.g. max 4 jobs, 10 cpus, whatever). So it is needed that a job is started from the user that kicks off the job. Would it be possible to sudo to some user using a ssh key (defined in the project) before executing cmd?


#6

In my proposal I think you would still set the worker pool up for sudo isolation as normal — it is just that the shell script the autoscaler calls is what allocates the worker. So the isolation plugin does not even change?

Honestly i can probably get the skeleton of this up (for calling the script) early next week.

In the above example you could just not set the max limit.

I’m a bit interested in your use case as it looks like this is used in a bunch of supercomputers so if you want to share or message me details I’m interested!


#7

Sure, so I am mainly doing machine learning stuff. For running experiments we have a slurm based cluster. There we have some nodes with multiple GPUs per node. Slurm is needed to isolate GPUs (and also cpu/mem) for usage from different users. It also takes care to restrict a user to occupy all the GPUs.

For now the workflow is to login to a login node of the cluster. This node does no heavy work and is only needed for the users to interact with the cluster. Then you pull your code/data onto the shared filesystem.
Then you run a slurm job, which starts the experiment. It is pretty common that you want to run multiple experiments in parallel. This is kind of painful since you may have to have multiple copies of the same codebase with only small changes.

So my goal is to adapt the CI workflow from software development somehow. Then you could define your experiments in different git branches for example. And you wouldn't need to care about execution. It also improves reproducibility, which is quite important for such experiments.


#8

Regarding the sudo isolation, this comes into operation when the worker is started right?
But for executing the slurm job as a given user it would be needed to do that before the worker is started using the cmd command. So basically the worker needs to be setup based on information of the queued job.


#9

The current operation is the worker checks out the repo then the sudo isolation happens before cd’ing into the build root to run the script, which is saved as vespene_launch.sh

If you need to sudo before allocating the slurm job then I can see about passing the sudo name from the worker pool as a variable to the autoscaler maybe?

If you need it per project I can also hand you every variable on the project too, but if the user can pick that it is not so secure if they can also edit the projects.

The SSH keys are all added prior to establishing the build root via SSH add - so the build WILL have access to them but you won’t know much about what user they are for.

Technically i could give you the names of the SSH key objects too - possibly might work.

What seems most logical there?


#10

Also technically, the "build.created_by" variable says what user (if not a webhook) launched a build.

This is less useful for a few reasons:

(A) if a webhook, we don't have any user there
(B) this implies the user names/password in Vespene are LDAP-synchronized (you could do this with Django auth) - but if not set that way, might allow them to obtain privs you don't want them to get just by creative account naming.

What I am planning here is handing the plugin the information, but not neccessarily sudo'ing for you.

For instance, like in the terraform example mentioned above, you COULD use {{ desired_size }} or whatever as a variable, and if you wrote your own python plugin you will get a copy of the actual build object and introspect all sorts of things from that, like build.project.ssh_keys or build.project.created_by or build.worker_pool.sudo_user


#11

So I could imagine the following. A user would have to add an ssh key with a given name (which allows to sudo to a user on the cluster). Second the user has to define via variable the username to use for sudo'ing.


#12

That seems like it would work for what you want to do.

The Vespene authz is already set up to where a user can decide who can use the SSH keys (when editing a SSH key, you can control the ownership groups).

I'll implement a basic shell script that passes in a bunch of variable contents (the example above - the params might be a little different when I get done), add a documentation page, and when I get done the python code behind the shell executor would be very easy to add as an example to add your own for your own particular slurm setup.

I'll try to document how to write your own executor a bit more on that docs page as well.


#13

Opened a ticket for me to get this started - https://github.com/vespene-io/vespene/issues/80


#14

That's great, thanks!


#15

Hey I ended up implementing this a bit differently, but started a new thread because it had broader applicability than slurm in general:

The isolation thing isn't covered so much, and there's no way to introspect the user's SSH keys because this is about allocating workers.

So I'm not sure this totally helps your use case.

You still can play a lot of interesting games inside an isolation plugin (including copying the build root to another share, kicking off some stuff, and then running it, waiting on it, whatever), you'll just have to do it yourself, more or less.


#16

Hey, thanks anyway. I started to implement my own "ondemand" scheduler, which starts a worker, when a build is queued. It basically works. The only problem I have is to ensure, that the worker only processed the one job, it was started for.

I would add an extra argument to the worker-command. Can I open a pull request to add that to the main-codebase? So I don't have to maintain a separate version. Everything else doesn't need any changes to the main-codebase (so far at least).


#17

Sure!

Want to do something like --build-id= where it doesn't poll, just tries to pop off the build, and fails if it is not there?

Seems like it should be a very short amount of code and it might prove useful for debugging/testing or something else later.


#18

Yes, exactly. Then I will submit a pull request soon.