Autoscaling Features: Just Added


#1

Autoscaling plugins are now configurable on the Worker Pool objects.

This is very VERY new and needs some help testing.

The autoscaler assumes you are using some automation of your own that specifies the worker image, or otherwise pushes and maintains the worker code. That is to say, it's basically a facility for deciding how many nodes you need, then calling something like Cloud Formation or Terraform to make sure you have the right number of nodes.

It doesn't speak AWS API, Kubernetes API, or OpenStack API or anything like that directly - That being said, I welcome people to build plugings that DO these things, pull requests would be welcome if the interfaces are stable.

The autoscaler requires a management process to issue autoscaling commands

To actually run the autoscaler, pick a single node and run:

python manage.py autoscaler --queue

As part of this series of commits I have also added some commands that help the worker nodes terminate early if they have no work to do, or if they have been running for a while, intended to be used only with ephemeral build nodes.

These are:

--max-builds=N
--max-wait-minutes=N

And both are described in the documentation.

This is just a start and can likely be improved in lots of ways.

Documentation here: http://docs.vespene.io/autoscaling.html

What's pretty nice is this took just a DAY to add, so hopefully showcases some of the quick things you do with the Vespene plugin model and database model if you scan through the commits.

Questions/comments/thoughts/feedback/docs-feedback ideas? Let me know what you think.


SLURM Isolation
#2

Oh, and you'll probably get used to this:

(A) update your source code in /opt/vespene to the latest
(B) "make migrate" to apply DB migrations
(C) bounce any running services

No need to run other steps


#3

Here's the commit if people are interested in exploring the source or hacking on it:


#4

(Removed a few useless lines post commit, but you get the idea)


#5

I updated my original post after a really good suggestion.

Shell.py is basic and runs Terraform and Cloud Formation:

Alternatives to speak directly to (for example) an ELB and adjust the size of an existing ELB are probably even LESS code.

So we would definitely entertain those kinds of plugins in tree - whether that is AWS, GCE, Azure, Kubernetes, OpenStack, etc.

Things should use the available client python libraries.

Then, add the plugin to the plugins.py file to make it available.

You can reuse the "executor_command" field to put in the name the load balancer.

I'll probably rename this field shortly to "executor_configuration".

There shouldn't be a need to provide security information, for instance the IAM role of the autoscaling script should allow access to the Amazon API. In other cloud cases, if we need to provide security info, we can use the usual dotfiles, most likely

Let me know if there are questions... this would allow for extremely fast autoscaling and not require people to learn a cloud config tool.