IRC channel logs

2021-06-17.log

back to list of logs

<rekado_>hmm, there seems to be a problem after the migration
<rekado_>a user reports seeing invalid path errors:
<rekado_>guix install: error: path `/gnu/store/ba1a66jwfbv0rww2ixh2fr6n7m2xxavx-r-seurat-4.0.2' is not valid
*rekado_ is afraid
<rekado_>looks like the db and the store went out of sync
<rekado_>my guess is that the daemon was running without being managed by pacemaker, so any changes haven’t been synced
<rekado_>this high-availability setup is gnarly
<rekado_>seems fixed now, but I don’t understand this inherited setup well :-/
<civodul>hey rekado_
<civodul>what's pacemaker?
<civodul>something akin to Ansible?
<rekado_>no, it’s a high-availibility cluster service
<rekado_>you specify nodes as part of a high-availability cluster, define resources, add constraints, and then the system aims to provide the defined resources under the given constraints.
<rekado_>in our case we have two nodes; guix-daemon is a managed resource constrained to only run on the first node; the NFS server is another managed resource that runs preferentially on the first node, but will run on the second node when the first one dies.
<rekado_>the IP address is also a shared resource.
<rekado_>etc
<rekado_>since there are resource constraints (e.g. guix-daemon may only run after we have mounted the cluster file system, and it may never run on the second node)
<rekado_>… certain services may only be started by pacemaker, not by systemd directly
<rekado_>so you have a *disabled* guix-daemon.service to ensure it won’t just get started when the system boots, so that pacemaker can start it (via systemd) ensuring that the constraints are not violated.
<rekado_>the whole point of all this is that /gnu/ and /var/guix/profiles/ should always be available on all client; in the worst case it’ll be in read-only mode.
<rekado_>so we can take the first node down, upgrade it, change it, and the worst impact on users is that they can’t install new software (because they can’t talk to the daemon). Their cluster jobs that depend on /gnu to remain available will not be impacted.
<rekado_>it’s not terribly complicated (because we don’t have that many resources that need managing), but:
<rekado_>a) all these resource definitions are done outside of the configuration management system (puppet) due to the version of pacemaker that we have to use on RHEL
<rekado_>and b) all the rest *is* done in puppet, which is stateful and so makes it really hard to anticipate how it’s going to fail
<rekado_>I’d *love* to do this declaratively with Guix System.
<civodul>i see, and i can sympathize :-)
<civodul>well, i haven't used puppet, but the combination you describe has a very stateful and error-prone feel
<rekado_>yeah, it’s everything I have tried to forget about since using Guix System.
<rekado_>I still have a bunch of locations that are considered invalid
<civodul>hmm fishy
<rekado_>yeah
<rekado_>not sure, but it seems that just “guix build /gnu/store/whatever” marks it valid.
<rekado_>so if I can get it to print me a list of invalid locations I can at least fix them without problems.
<rekado_>this must have been the result of having the db on the node go out of sync with the SAN storage under pacemaker management.
<rekado_>bah
<rekado_>I really don’t like to have to deal with this.
<rekado_>but… it’s faster.
<civodul>but it's likely a one-off issue due to the transition, somehow, no?
<rekado_>a one-off as a result of the few hours during which pacemaker was not in charge of the daemon, yes
<rekado_>hopefully :)
<rekado_>I’ll write out all valid paths from the db and compare them to what’s in /gnu/store
<rekado_>hmm, it says, for example, that /gnu/store/5rn829p02n0nja20a0ky4q9v6dfkzq61-rstudio-server-1.4.1717 is not valid, but the database says that it *is* valid
<rekado_>(it’s one of the values in the ValidPaths table)
<efraim>objects missing in the store should be easier, 'guix gc --verify=repair,contents' normally takes care of those.
<efraim>As you've noted before though, things that are in the store but missing in the database suddenly disapear
<rekado_>it’s odd, though, that this thing exists in the store *and* the database, yet Guix on another server says it doesn’t exist
<rekado_>sorry, says it’s invalid
<rekado_>at some point they must be using different book keeping
<rekado_>I’m too tired to investigate this now and my baby got a fever :-/
<efraim>baking the substitute?
<efraim>sad baby :/
<rekado_>snotty mcsnotface
<rekado_>trying something like this now: echo /gnu/store/* | xargs -n 100 guix build
<rekado_>I have no way of knowing which store item will be considered invalid
<rekado_>the database on the node running guix-daemon says it’s fine, and it’s in the store
<rekado_>this is horrible
<rekado_>I’m playing with this one user’s profile; just trying to upgrade a handful of packages.
<rekado_>and one after the other claims to be invalid
<civodul>rekado_: the other node, where guix claims it's invalid, must be talking to a different daemon, no?
<civodul>is GUIX_DAEMON_SOCKET pointing to the right daemon?
<rekado_>oh my…
<rekado_>there is a /gnu/var/guix now
<rekado_>on the server where it should be using /var/guix
<rekado_>something must have remembered that old localstatedir and created it
<rekado_>moved it out of the way and created a symlink from /gnu/var/guix to /var/guix
<rekado_>and it seems to work now
<rekado_>previously I had a symlink from /gnu/var/guix/profiles to /var/guix/profiles
<rekado_>so… everything is actually okay, the “guix” commands were just discovering /gnu/var/guix/db and used *that* while the daemon happily updated /gnu/store.
<rekado_>lessons learned: don’t mess with localstatedir, and if you do you shouldn’t mess with it again.
<civodul>:-)
<civodul>glad you found the culprit!
<rekado_>this is, of course, a very unusual problem
<rekado_>but it has bothered me a lot in the past that there is this dichotomy between /gnu/store and /var/guix/db
<rekado_>I totally see the necessity of *having* the db
<rekado_>but: I wonder why this problem had to happen at all
<rekado_>why does “guix” need the db? Isn’t it enough for “guix-daemon” to have the db?
<civodul>rekado_: "guix" client commands don't access the db; they only talk to the daemon, which does it on their behalf
<civodul>that's why i was wondering earlier if it could be that they were talking to the wrong daemon
<rekado_>when the “guix” client is configured with a different localstatedir than the daemon, will it make the daemon use that other database?
<rekado_>PurpleSym: I just found a problem with my second patch to rstudio
<rekado_>that patch sets the R version to whatever was set in the “active” session
<rekado_>but it fails to look up the *user’s* session files and instead looks at the *server* user’s file – which may not even exist and does not contain any previously recorded session
<rekado_>I didn’t catch this earlier because I ran rserver as the same user that ran rsession
<rekado_>now I’m testing this on a server where it runs as root.
<rekado_>I’ll provide an updated patch some time next week