IRC channel logs

2021-04-25.log

back to list of logs

***hitchcock.freenode.net sets mode: +o ChanServ
<rekado_>so…
<rekado_>GWL is still slow, but much faster than before, so I guess I should move on to adding DRMAA support.
<rekado_>next step is to build a test cluster; I’m thinking of using Guix to build VMs that together form a simple two-node slurm cluster.
***hitchcock.freenode.net sets mode: +o ChanServ
<rekado_>I got a test cluster now, and I can submit jobs with srun, but slurm-drmaa isn’t working properly.
<rekado_>when I let Guile DRMAA connect to the cluster I get this error:
<rekado_>guile: error: plugin_load_from_file: dlopen(/gnu/store/4lnskfwx6s7k8d3gr3mdq1p27z784m1r-slurm-20.11.3/lib/slurm/auth_munge.so): /gnu/store/4lnskfwx6s7k8d3gr3mdq1p27z784m1r-slurm-20.11.3/lib/slurm/auth_munge.so: undefined symbol: slurm_conf
<rekado_>I remember this bug; that’s when I gave up last tim.
<rekado_>*time
<civodul>does auth_mungo.so have libslurm.so in NEEDED? in RUNPATH?
<rekado_>no, that’s likely the problem here
<civodul>could be a miscompilation issue that's hidden when using dlopen with RTLD_GLOBAL
<civodul>yeah
<rekado_>I suppose we’d need to patch it to link with libslurm
<civodul>prolly, adding -lslurm in the right place
***hitchcock.freenode.net sets mode: +o ChanServ
<rekado_>got past that, but there are more libraries that need linker changes
<rekado_>nothing insurmountable
<efraim>I'm around but on my phone. So far we've been using Debian's slurm and munge
<rekado_>efraim: ah, I was hoping you had a working configuration
<rekado_>but that’s okay, I think I’m close to having something that works well enough for my testing purposes
<zimoun`>oh, neat! 2 Guix VM cluster with Slurm? I would be interested to see you config files. ;-)
<rekado_>I tried to run everything as a regular user, but it seems that munged needs to run as root
<rekado_>seems to work now
<rekado_>I can submit jobs via guile-drmaa and control them
<zimoun`>neat!
<rekado_>but the wait call doesn’t seem to work
<rekado_>but that’s exactly why I set up this test environment
<rekado_>BTW: no VMs. Just slurmd, slurmctld, and munged in a Guix environment.
<zimoun`>ok, it could cool to be able to deploy a toy cluster with various “guix system vm”. :-)
<rekado_>yes, I agree
<rekado_>it’s not my primary goal with this experiment, though. I just want something quick to test guile-drmaa and implement the drmaa backend for the GWL.
<rekado_>I’m a little worried about the fact that the DRMAA “wait” call is a great way to wait forever.
<zimoun`>yeah, I understand. :-) It is already more than enough to test GWL and guile-drmaa.
<rekado_>I submitted two very short-lived jobs on hold, then released one of them and immediately after that issued the wait call.
<rekado_>the job seems to have completed immediately, perhaps even before the wait call was received.
<rekado_>so now I’m waiting for any job to complete, but there are none, because one is on hold and the other is already done.
<zimoun`>Ah and slurm&friend are correctly setup?
<rekado_>yes, seems so
<rekado_>slurmctld also confirms that the jobs have been submitted on hold, that one has been released, and that one has completed.
<rekado_>I guess the safest way is to run a wait loop in a separate thread and only then use the “control” call to release jobs from hold
<rekado_>you can set up a timeout, but I wouldn’t want to miss any jobs
*civodul finds this anagram of "drama" quite disturbing
<rekado_>they had a perfect opportunity to get the name right — and chose not to.
<rekado_>it’s one of many disappointments with DRMAA
<civodul>:-)
<zimoun`>civodul: any idea about the release date? v1.3 right? I am asking for writting the hpc.guix.info part.
<civodul>zimoun`: at first sight, at best within a week
<civodul>great that you're volunteering for the blog!
<civodul>i think apteryx was going to put out an RC today or tomorrow
<zimoun`>ok, and apteryx is also updating the NEWS / ChangeLog ?
<civodul>not sure
<civodul>i may take a look at NEWS tomorrow or so
<civodul>would you like to help?
<zimoun`>Sorry, not really. I take my part for Science part, not yet for proper. :-)
<zimoun`>BTW, as I proposed, somehow, this NEWS file (or another maybe?) should be updated by committers each time they push relevant things. At least, a strong encouragement to do so. :-) It would ease this “boring” task.
<civodul>yes, i agree, but somehow we again failed to do that
<zimoun`>well, I have a tiny Org where I tracked few items. I thought it should be easier to collect them at the time. Somehow, there is too much things to be done. Similarly that committers are encouraged to review, they should be encouraged to update “something” when they push. I guess Emacs is doing that way.
<zimoun`>bah, let’s discuss that after the release. ;-)
<civodul>yup!
<rekado_>slurm + guile-drmaa works after all; even though slurmctld said the job was completed it was never actually processed because my only node was marked as “down”. Fixed it and now “wait” is working just fine.
<rekado_>I misread the output; it just informed me that the “update job” RPC was completed, not the job itself
<rekado_>it prints “_job_complete” when the job is complete, along with the exist status