<rekado_>with PiGx we bumped into a limitation of Snakemake that I’d like to address in GWL
<rekado_>it’s again in the context of running workflows on rented machines (aka cloud)
<rekado_>with AWS it’s easier to rent a really big node and run everything locally than to build up an HPC cluster
<rekado_>this means, however, that the most expensive parts of the workflow determine the size of the node.
<rekado_>this is very wasteful for long running computations that consist of expensive and cheap operations
<rekado_>so I was thinking of building a simple queueing system: have at least two nodes, one big one small, and queue up jobs manually so that we fill the queues sensibly (I’m not even thinking about “optimally”)
<rekado_>start the big node only right before the first job is executed and terminate it right after the last job finishes.
<rekado_>this requires offline computation of the process graph, and it requires knowledge about the complexity of each step
<rekado_>finally, it requires the ability to impose an execution schedule on the workflow engine
<rekado_>Snakemake can compute the process graph, but it does not have a concept of work complexity (which may be a function of the size of the input data) nor does it allow for an execution schedule to be imposed
<rekado_>Now I wonder if offline scheduling is at all sensible, or if it would be better if the GWL gained the ability to schedule jobs at runtime based on actual completion feedback.
<rekado_>this gets dangerously close to implementing a really poor version of SGE or slurm
<rekado_>ignoring the cloud for a moment, though, I think it would be a useful feature.
<rekado_>say you have three idle machines in your network. You have SSH access to them and they have a shared filesystem to exchange data.
<rekado_>you don’t have a proper HPC setup, no cluster scheduler.
<rekado_>it sounds like a good idea to me to tell the GWL about these machines, have the GWL query the most basic resources (number of cores, amount of RAM), and then line up jobs for these three queues.
<zimoun>rekado_, civodul: scheduling is an art, but topological graph sort based on complexity seems simple enough to be doable and should be enough for GWL applications.
<rekado_>zimoun: I’d sidestep the problem of data locality completely.
<zimoun>rekado_: I do not like genetic algorithm, as in librcps, it sounds to me like: we try almost randomly to find the best.
<zimoun>yeah, I understand, that just it is a parameter in the AWS context, I guess.
<rekado_>there are two approaches I’m okay with: a) ignore the problem and assume shared file systems, or b) give workflow authors the ability to “group” processes to specify a preference to execute them on the same location
<rekado_>re librcps: yes, whenever someone uses genetic algorithms exclusively it always sounds to me as if they only just discovered this cool new trick :)
<rekado_>I just stumbled upon it while searching for libraries
<rekado_>this space is surprisingly underdeveloped
<zimoun>about data transfer, I agree, that just to be clear about the bounds of the problem. :-)
<zimoun>about librcp, I have not read the doc, though. Just the front page :-)
<rekado_>what you will find *a lot* is full blown cluster schedulers, Cron-like stuff, general purpose monoliths like Apache Airflow, and some “Kubernetes is cool” stuff.
<rekado_>the documentation is only 13 pages; it’s like the front page, but uses more words and includes an API reference.
<zimoun>well, I do not know if the code is really available via SVN, giving a look ia the tarball
<zimoun>rekado_: a small C code. I do not know if it is worth to use it instead of implement something. I will think a bit about the problem: the least effort to have scheduling in GWL and maintainable.
<rekado_>what we do so far is very simple: we compute process dependencies and if the selected engine is Grid Engine we generate job scripts that declare other jobs as dependencies (by name)
<rekado_>I want GWL to have a runner that keeps track of state (jobs yet to execute, jobs currently running, jobs that have failed, etc), and that can trigger the execution of job scripts — locally or remotely
<zimoun>I think I get your points and from my understanding I agree. What is not clear to me is the correct level. Currently, GWL is doing more or less the same as Snakemake, maybe a bit less :-), with the limitations you described. It is not clear how to have the features you are describing without re-implementing Cuirass or yet another grid engine.
<rekado_>I think it’s fine to implement a teeny subset of grid engine
<rekado_>currently the GWL is rather useless if you don’t already have an SGE setup
<rekado_>you can run workflows *locally* as an alternative, but that’s really limiting.
<zimoun>I agree, and one key of your argument is AWS, at least IMHO. :-)
<rekado_>to me it comes back to the “SSH backend” I wanted for the GWL
<zimoun>rekado_: I am going to give a look after our Outreachy weekly meeting. :-)
<zimoun>the point is to be able to have coexistence of R packages, from R land and from Guix, right?
<rekado_>zimoun: it’s coexistence of R from Guix and R from Guix, really
<rekado_>it lets you install R packages via Guix from within a running R session
<rekado_>you don’t have to use the Guix names for the packages as they are renamed automatically
<rekado_>if a package isn’t avialable in Guix yet it spawns the importer, writes to ~/.Rguix/packages.scm, puts that on GUIX_PACKAGE_PATH and installs from there.
<rekado_>it doesn’t yet do installs from git or hg repos, but that’s just a matter of looking at the argument. If it’s a URL then we try the hg or git “archive” for the CRAN importer.
<zimoun>ah yes, sorry I have overlooked the “import” part. :-)
***rekado_ is now known as rekado
<zimoun>it seems a good idea. From first look, seems good. Especially as first attempt. For a long-term, it appears to me better to have “guix repl” and Scheme instead of “system(guix, c(<string>))”. As discussed with Emacs-Guix or Nyxt.