IRC channel logs

2020-12-11.log

back to list of logs

<rekado_>with PiGx we bumped into a limitation of Snakemake that I’d like to address in GWL
<rekado_>it’s again in the context of running workflows on rented machines (aka cloud)
<rekado_>with AWS it’s easier to rent a really big node and run everything locally than to build up an HPC cluster
<rekado_>this means, however, that the most expensive parts of the workflow determine the size of the node.
<rekado_>this is very wasteful for long running computations that consist of expensive and cheap operations
<rekado_>so I was thinking of building a simple queueing system: have at least two nodes, one big one small, and queue up jobs manually so that we fill the queues sensibly (I’m not even thinking about “optimally”)
<rekado_>start the big node only right before the first job is executed and terminate it right after the last job finishes.
<rekado_>this requires offline computation of the process graph, and it requires knowledge about the complexity of each step
<rekado_>finally, it requires the ability to impose an execution schedule on the workflow engine
<rekado_>Snakemake can compute the process graph, but it does not have a concept of work complexity (which may be a function of the size of the input data) nor does it allow for an execution schedule to be imposed
<rekado_>Now I wonder if offline scheduling is at all sensible, or if it would be better if the GWL gained the ability to schedule jobs at runtime based on actual completion feedback.
<rekado_>this gets dangerously close to implementing a really poor version of SGE or slurm
<rekado_>ignoring the cloud for a moment, though, I think it would be a useful feature.
<rekado_>say you have three idle machines in your network. You have SSH access to them and they have a shared filesystem to exchange data.
<rekado_>you don’t have a proper HPC setup, no cluster scheduler.
<rekado_>it sounds like a good idea to me to tell the GWL about these machines, have the GWL query the most basic resources (number of cores, amount of RAM), and then line up jobs for these three queues.
<civodul>rekado_: sounds interesting
<civodul>scheduling is a research topic in itself, so it's a bit of a can of worms
<civodul>the good thing is that there are probably things to borrow from existing work
<rekado_>yeah, that’s what I’m fearing
<rekado_>I’d be happy if there was some generic library I could include that would take care of this…
<rekado_>I don’t really want to have to become an expert in all things scheduling first
<civodul>you could instead focus on how to architecture the code such that you can plug in different scheduling strategies
<civodul>in GWL the whole graph is know before run time, right?
<rekado_>yes
<rekado_>the graph is absolutely primitive right now. It’s only input/output directed.
<civodul>you could first allow for pluggable static schedulers
<civodul>when both available resources and tasks are known in advance
*rekado_ nods
<rekado_>the GWL does allow for resource specifications to be attached to each process; not sure if it makes sense to allow specifications that are functions of the input data size.
<rekado_>the complication here is that the weight of each node would only become known at runtime
<rekado_>and that’s no longer compatible with a static schedule
<civodul>unless you do a dry run to gather task data
<civodul>like profile-guided scheduling
<rekado_>the inputs of some processes are only available after completion of other processes, so a dry run would not provide the required information.
<rekado_>I stumbled upon this: http://librcps.org; I wonder if there are more libraries like that.
<rekado_>(objectionable use of slavery as an illustration…)
<rekado_>I think I’ll need to see more implementations to know what I would need to change to make scheduling strategies pluggable
<rekado_>I’m reading about DIET right now
<zimoun>rekado_: sounds good to me the scheduling GWL.
<zimoun>one point is about data transfer though
<zimoun>rekado_, civodul: scheduling is an art, but topological graph sort based on complexity seems simple enough to be doable and should be enough for GWL applications.
<rekado_>zimoun: I’d sidestep the problem of data locality completely.
<zimoun>rekado_: I do not like genetic algorithm, as in librcps, it sounds to me like: we try almost randomly to find the best.
<zimoun>yeah, I understand, that just it is a parameter in the AWS context, I guess.
<rekado_>there are two approaches I’m okay with: a) ignore the problem and assume shared file systems, or b) give workflow authors the ability to “group” processes to specify a preference to execute them on the same location
<rekado_>re librcps: yes, whenever someone uses genetic algorithms exclusively it always sounds to me as if they only just discovered this cool new trick :)
<rekado_>I just stumbled upon it while searching for libraries
<rekado_>this space is surprisingly underdeveloped
<zimoun>about data transfer, I agree, that just to be clear about the bounds of the problem. :-)
<zimoun>about librcp, I have not read the doc, though. Just the front page :-)
<rekado_>what you will find *a lot* is full blown cluster schedulers, Cron-like stuff, general purpose monoliths like Apache Airflow, and some “Kubernetes is cool” stuff.
<rekado_>the documentation is only 13 pages; it’s like the front page, but uses more words and includes an API reference.
<zimoun>repo using svn is not appealing ;-)
<rekado_>could be worse (aka CVS)
<zimoun>hehe
<zimoun>have you tried to clone?
<zimoun>I get svn co https://www.librcps.org/svn/librcps librcps svn: E170013: Unable to connect
<rekado_>no, I only read the documentation
<rekado_>haven’t looked at the actual code
<zimoun>well, I do not know if the code is really available via SVN, giving a look ia the tarball
<zimoun>rekado_: a small C code. I do not know if it is worth to use it instead of implement something. I will think a bit about the problem: the least effort to have scheduling in GWL and maintainable.
<rekado_>what we do so far is very simple: we compute process dependencies and if the selected engine is Grid Engine we generate job scripts that declare other jobs as dependencies (by name)
<rekado_>I want GWL to have a runner that keeps track of state (jobs yet to execute, jobs currently running, jobs that have failed, etc), and that can trigger the execution of job scripts — locally or remotely
<zimoun>I think I get your points and from my understanding I agree. What is not clear to me is the correct level. Currently, GWL is doing more or less the same as Snakemake, maybe a bit less :-), with the limitations you described. It is not clear how to have the features you are describing without re-implementing Cuirass or yet another grid engine.
<rekado_>I think it’s fine to implement a teeny subset of grid engine
<rekado_>currently the GWL is rather useless if you don’t already have an SGE setup
<rekado_>you can run workflows *locally* as an alternative, but that’s really limiting.
<zimoun>I agree, and one key of your argument is AWS, at least IMHO. :-)
<rekado_>to me it comes back to the “SSH backend” I wanted for the GWL
<rekado_>AWS is just a special case of that
<rekado_>I
<rekado_>I don’t really want to elevate AWS by giving it priority treatment in the GWL
<rekado_>I don’t want to use their fancy services any more than necessary to allow for simple remote execution
<rekado_>to me this is always the same kind of setup: a number of nodes, shared storage, queues.
<rekado_>on AWS this is EC2 nodes, EFS for shared storage — and the queues must then be tracked locally by GWL.
<rekado_>it’s easy to conceptually remove AWS: replace EC2 with physical machines which can be accessed via SSH, and swap EFS for NFS.
<rekado_>a case in between is a local “cloud”: there needs to be some initial configuration to spin up VMs, but ultimately it’s just the same as the other two cases.
<zimoun>I totally agree. What I meant was AWS is a good use case and easy to communicate about.
<rekado_>zimoun: what do you think of this abomination: https://elephly.net/paste/1607701181.R.html
<rekado_>load it in R, then do guix.install("deaR")
<rekado_>deaR is a package we don’t have in Guix yet
<rekado_>then library(deaR)
<rekado_>this one looks better, sorry: https://elephly.net/paste/1607701370.R.html
<zimoun>rekado_: I am going to give a look after our Outreachy weekly meeting. :-)
<zimoun>the point is to be able to have coexistence of R packages, from R land and from Guix, right?
<rekado_>zimoun: it’s coexistence of R from Guix and R from Guix, really
<rekado_>it lets you install R packages via Guix from within a running R session
<rekado_>you don’t have to use the Guix names for the packages as they are renamed automatically
<rekado_>if a package isn’t avialable in Guix yet it spawns the importer, writes to ~/.Rguix/packages.scm, puts that on GUIX_PACKAGE_PATH and installs from there.
<rekado_>it doesn’t yet do installs from git or hg repos, but that’s just a matter of looking at the argument. If it’s a URL then we try the hg or git “archive” for the CRAN importer.
<zimoun>ah yes, sorry I have overlooked the “import” part. :-)
***rekado_ is now known as rekado
<zimoun>it seems a good idea. From first look, seems good. Especially as first attempt. For a long-term, it appears to me better to have “guix repl” and Scheme instead of “system(guix, c(<string>))”. As discussed with Emacs-Guix or Nyxt.
<rekado>yes, probably
<rekado>though I can’t think of a real downside yet
<rekado>because this feature is much poorer than what Emacs-Guix attempts to be
<zimoun>I have to go, meeting. Discuss that later. :-)
<rekado>good luck!
<zimoun>rekado: thanks :-)
<zimoun>guix.install looks good.