IRC channel logs

<rekado>I finally upgraded our guix-daemon at the MDC and things appear to be working fine

<rekado>downloads from ci.guix.gnu.org are *terribly* slow!

<rekado>I get something like 350kB/s

<rekado>it starts out at 2.2MB and then drops

<rekado>downloading the thing over HTTP with wget I see 429MB/s downloads

<rekado>here’s my shell session: https://elephly.net/paste/1618735370.html

<rekado>“guix download https://ci.guix.gnu.org/nar/lzip/l0jsrzgxdcghb87gly3zc5mvr9ylxi8g-0ad-data-0.0.23b-alpha” is very fast.

<efraim>no idea. All I can say is I too have some phantom download speed differences over the local network between rsync and guix-publish.

<rekado>I see in the “guix build” strace log that it stats individual target files and then writes those

<rekado>this might well be the source of slowdown from what I can tell at this point

<rekado>for every file in the archive it has a bunch of stat, link, chmod, rename, seek, etc calls

<rekado>and I bet these are not cheap on NFS

<rekado>I wonder if it’s due to this patch set: https://issues.guix.gnu.org/45253

<rekado>in any case, I’ll revert back to the commit before c7c7f068c15e419aaf5ef616516aa5ad4e55c2fa and see if it changes anything

<rekado>(trying commit f6f6e1efeecd553c3af4c31695b17fb69849967b)

<rekado>no changee :-/

<rekado>trying aba8def46d392b3ef2278d16a2c9708fab05c6fd now

<rekado>also no change

<rekado>going back to the commit that we used before the upgrade

<rekado>(but from a git checkout, not “guix pull”)

<rekado>so… down at commit f41378b797574dc550a700d12cb1752497940496 the download is fast

<rekado>this is an apples vs horses comparison, though

<rekado>the old daemon downloads not the substitute for 0ad-data, but the sources, and it doesn’t download the lzip substitute for the sources but the gzip substitute.

<rekado>argh!!

<rekado>I think I may have lost data due to Guix deleting stuff from /gnu/store because it used the wrong db

<rekado>shit shit shit

<rekado>the file /gnu/store/fa6wj5bxkj5ll1d7292a70knmyl7a0cr-glibc-2.31/lib/ld-linux-x86-64.so.2 no longer exsts

<rekado>oh fuck

<rekado>why does the daemon do that? It shouldn’t delete things from /gnu/store unless I use “guix gc”.

<rekado>just deleting stuff on “guix pull” is wrong.

<efraim>I feel like the db & store should be more strongly divorced from the rest of Guix, it seems like it'd be too easy to to run bootstrap && configure --localstatedir=/tmp && make && guix gc

<civodul>rekado: re download speed, it could be simply that the lzip 0ad-data download is CPU-bound

<civodul>oh, i hope you can recover the store

<civodul>:-/

<rekado>problem is I really don’t know how much was deleted

<rekado>it doesn’t *say* that it’s deleting something

<rekado>I only noticed when it *failed* to delete something because the directory wasn’t empty

<rekado>it was about to delete that glibc directory

<rekado>and failed to delete all locales

<rekado>i recovered the glibc directory from zfs snapshots

<rekado>but … I don’t know how much else is borked now

<efraim>time for guix gc --verify=repair,contents ?

<rekado>I’m so scared of running “guix gc”

<rekado>is it going to decide that things it can’t access are free to collect?

<rekado>because it won’t be able to access all user home directories

<rekado>I’m really scared of it deciding to purge a few things without asking

<efraim>(according to the database) if something is supposed to be there it will get it back by downloading or building, if it's not supposed to be there I don't know

<efraim>it might not even see it if it's unknown to the DB

<rekado>in my past invocations of “guix gc” it would start by deleting links from the profiles directory

<efraim>IIRC those are the ones that were already unlinked with 'guix package -d'

<rekado>or those that the daemon *thinks* have been unlinked because it can’t see the source

<efraim>then I suppose you should make sure you're running the correct daemon with the correct DB

<rekado>the daemon sits on a separate machine, so it won’t be able to see the source files for links to the profile directory if those files are on other machines

*efraim thinks *suppose* isn't strong rnough

<rekado>I *am* running the correct daemon though.

<rekado>what failed me is to think that “guix pull” with a channels commit in 2019 would properly inherit localstatedir (which is /gnu/var/guix, not /var/guix)

<rekado>this has worked for all the other daemons I tried, but it did not work for the daemon from commit f41378b797574dc550a700d12cb1752497940496

<rekado> — back to the download problems though

<rekado>there is a very big difference between “guix download” and “guix build”

<rekado>both download the same file, but “guix download” is really fast and “guix build” is unacceptably slow

<rekado>I can’t call this upgrade complete when download speed has dropped to 160kB/s

<rekado>(with the build farm sitting in the neighboring rack even)

<rekado>do you have any tricks to debug this?

<rekado>the CPU is pretty idle while downloading

<rekado>(with guix build)

<rekado>it currently sits at 244KiB/s and progress appears to have stalled completely (no more than 1%)

<rekado>now it’s 84KiB/s

<rekado>but I think it just reflects the fact that no progress is happening

<rekado>and now it aborted with “1.3%guix substitute: error: TLS error in procedure 'read_from_session_record_port': Error decoding the received TLS packet.”

<efraim>it's not something that I've figured out :(

<rekado>I’ll take it to #guix :)

<civodul>ah, the dreaded TLS error

<rekado>I can reproduce this reliably with “guix build 0ad-data” — but only on this system that is hamstrung by NFS

<rekado>it’s slowish on my laptop, but it does go beyond 1%

*civodul runs that command

<civodul>rekado: could you check the guix-daemon version you're running?

<civodul>there were serious bugs related to connection reuse at one point

<rekado>I’m currently using the daemon from commit 84feaca4888c9916e1a97bb81e5d157673488b70

<rekado>that’s very recent

<rekado>I tried f6f6e1efeecd553c3af4c31695b17fb69849967b, aba8def46d392b3ef2278d16a2c9708fab05c6fd, and f41378b797574dc550a700d12cb1752497940496

<rekado>the last one is fine, but it led to data loss when I ran “guix pull” (to switch to the more recent 7033c7692ccbbbad8f7b9952015de071a5588e87)

<civodul>what's the store file name when you do "ps aux| grep guix-daemon"?

*civodul just completed the 0ad-data substitute download

<rekado>it’s frustrating, but I don’t see the store file name

<rekado>I start the daemon with this simple wrapper: https://elephly.net/paste/1618750010.html

<civodul>once you have the PID of guix-daemon, you can do: ls -l /proc/PID/exe

<rekado>ah, nice: /gnu/store/3rw6dzs5rhhv079fi00ybly9bp81q7ci-guix-daemon-1.2.0-21.4dff6ec/bin/guix-daemon

<rekado>I only ever used /proc/PID/cmdlnie

<civodul>thanks

<civodul>that's the latest one... not good

<civodul>is it 100% reproducible, the 0ad-data thing?

<rekado>yes

<rekado>on this NFS setup

<civodul>nice

<civodul>could you "strace -s 800 -p PID -o log", where PID is the daemon's PID, and the "guix build 0ad-data"?

<civodul>er

<civodul>"strace -f -s 800 -p PID -o log"

<civodul>s/and then/and then/

<rekado>done that before, so I could see a bunch of lstat, stat, chmod, etc calls

<rekado>is there something in particular that you’d like me to look at?

<civodul>the goal is to see how the connection to ci.guix.gnu.org terminates

<civodul>that's the only substitute server, right?

<rekado>correct

<rekado>okay, let me run the command again

<civodul>could you (1) locate the FD number for the ci.guix connection in the 'guix substitute' process, and (2) starting from the end of the log, locale the last recvmsg(2) or sendmsg(2) call on that socket?

<rekado>here’s the last recvmsg calls in the log: https://elephly.net/paste/1618750669.html

<rekado>in between the receives I see this sort of thing: https://elephly.net/paste/1618750732.html

<rekado>these are the now inlined checks and dedupe actions, IIUC

<civodul>right

<civodul>it looks like premature EOF, where the server drops the connection "too early"

<civodul>can you reproduce with: echo "substitute /gnu/store/l0jsrzgxdcghb87gly3zc5mvr9ylxi8g-0ad-data-0.0.23b-alpha /tmp/0ad" | GUIX_ALLOW_UNAUTHENTICATED_SUBSTITUTES=yes guix substitute --substitute ?

<civodul>if yes, then try the same with ./pre-inst-env from a checkout

<civodul>if it's still reproducible, then uncomment the TLS debugging statements in (guix build download)

<rekado>no, this one works

<rekado>and it’s faster

<civodul>wait, using the same 'guix'?

<rekado>yes

<civodul>you can look for "execve.*guix.*substitute" in your previous strace log, to make sure guix-daemon really is spawning the same 'guix'

<rekado>AFAICT it’s the same command

<rekado>it’s /gnu/store/r8xb0r8vdvhyrjm61yih27gkxl2p8f6k-guix-command

<rekado>I’m really digging this substitute command

*rekado tries to remember it for future debugging

<civodul>can you try the "guix substitute" command above until it fails? :-)

<civodul>if this one never fails, and the other one always fails, we'll have to play "spot the differences"

<rekado>sure!

<rekado>thank you so much for helping!

<rekado>I can’t get it to fail. Looking at the output of env I don’t see anything that would affect this.

<rekado>does the substitute command run under a different user account?

<rekado>(my tests here are done with the root account)

<civodul>no, it's started by guix-daemon and thus runs as root as well

<civodul>ah wait

<civodul>one thing that can make a different is _NIX_OPTIONS

<rekado>that variable is not set here nor does the daemon’s /proc/PID/environ file mention it.

<rekado>I’m now getting the daemon log for when it’s working. Maybe I can spot differences.

<rekado>the output is very different because when writing to /tmp/0ad there are no existing files to deduplicate

<rekado>but in any case: in the broken case it receives much less data before aborting

<rekado>in the good case it uses “read” and not “recvfrom”

<rekado>the chunks that it reads are bigger than those that are received with “recvfrom”

<rekado>“read” reads blocks of 65537 bytes

<rekado>“recvfrom” receives chunks of 16408 bytes

<rekado>first five bytes (always the same) followed by a chunk of 16408

<rekado>so… is there some indirection at fault here?

<civodul>hmmm

<civodul>i think GnuTLS uses recvmsg(2), not read(2)

<civodul>oh wait, maybe "guix substitute" run by hand defaults to http, not https

<civodul>so you would need:

<civodul>echo "substitute /gnu/store/l0jsrzgxdcghb87gly3zc5mvr9ylxi8g-0ad-data-0.0.23b-alpha /tmp/0ad" | GUIX_ALLOW_UNAUTHENTICATED_SUBSTITUTES=yes _NIX_OPTIONS=substitute-urls=https://ci.guix.gnu.org guix substitute --substitute

<civodul>rekado: does this one reproduce the issue?

<civodul>we should make it easier to test

<rekado>nope, this one works too

<rekado>I’ll record a strace log and see if it looks closer to the failing case

<rekado>the trace of the working invocation has no store interactions, but other than that it looks very similar to the broken one.

<rekado>I can clearly see some sort of 5 byte TLS header followed by 16408 bytes of data

<rekado>a few chunks like that are followed by writing to a newly allocated file

<rekado>the only difference I have found so far is that in the broken trace the last chunk is short.

<rekado>2706 bytes instead of 16408

<rekado>it is also interesting to me that it appears to be always the same file

<rekado>I’ll run this again to see if it always cuts off after 2706 bytes, and what would follow

<civodul>ah, locale?

<civodul>could it be that the daemon is running in a different locale?

<civodul>or that the client ("guix build") is instructing 'guix substitute' to switch to a different locale?

<civodul>and so, if there's a UTF-8 file name, things could go wrong

<civodul>(i'm thinking out loud because this is supposed to be addressed, but who knows)

<rekado>unfortunately, it’s not always the same file after all

<civodul>damnit

<rekado>and it’s not the same number of bytes either

<rekado>bummer

<civodul>there's still one that never fails though, right?

<rekado>right

<rekado>the plain substitute invocations don’t fail

<rekado>only those launched by the daemon

<civodul>what if you replace "/tmp/0ad" with "/gnu/store/0ad-tmp" (literally) in the command above?

<civodul>you'll need to run it as root, but that's ok

<rekado>ah, writing to NFS…

<civodul>yeah, just to be sure...

<rekado>arfff! It fails

<rekado>I can tell because it stalls

<rekado>at 0.9%

<rekado>(I’m writing to /gnu/tmp-0ad, as all of /gnu is on the NFS server)

<rekado>no wait

<rekado>it went past 0.9

<rekado>false alarm

<rekado>it’s terribly slow, though

<rekado><1MiB/s

<rekado>it completed with a reported 2.9MiB/s

<civodul>if it gets too slow, perhaps the connection could time out at some point?

<civodul>well, it's not *that* slow either

<rekado>yes, that’s my hunch too.

<rekado>it sits at 0.9% for a long time

<rekado>and at that time the reported average speed is in the KiBs

<rekado>for comparison, “guix download https://ci.guix.gnu.org/nar/lzip/l0jsrzgxdcghb87gly3zc5mvr9ylxi8g-0ad-data-0.0.23b-alpha” reports 340MiB/s

<rekado>I’ll try again to see if I can get it to fail when writing to /gnu

<civodul>but that's also writing to the store, no?

<rekado>yes!

<civodul>perhaps you could pass -t when stracing the daemon

<civodul>so we can see timings

<rekado>so I really don’t get why that other method is so much slower

<rekado>okay

<civodul>and also, check the nginx logs on berlin when it crashes

<civodul>is your daemon running locally and writing to NFS?

<rekado>yes, local daemon with /gnu and /gnu/var/guix both sitting on a remote NFS

<civodul>ah but wait, the throughput different is just decompression + unpacking

<civodul>the "guix download" command creates a single file

<civodul>whereas "guix substitute" decompresses and creates many files

<rekado>right

<rekado>hmm, “-t” shows me that *all* the action happens within the same second

<rekado>within that same second it writes a few different files

<rekado>so it’s not like the connection is idle for a long time

<rekado>but that’s really not what it looks like

<civodul>yes, but perhaps at some point openat(2) & co. become slower on the NFS?

<rekado>what the user sees is that the download starts, then it stays at 0.9% and the only thing that gets updated after a few seconds is the speed indicator

<civodul>you mean feedback slows down, but operations don't?

<rekado>yes

<rekado>there’s nothing obvious in the logs

<rekado>lots of operations in that last second

<civodul>ok

<rekado>lots of files written

<rekado>but on the command line things just look frozen.

<rekado>I should note, though, that even in the successful case feedback around 1% updates more slowly than afterward

<rekado>(which is why I thought earlier that the test write to /gnu/tmp-0ad was about to fail)

<rekado>I wonder if we can tweak timeouts somewhere

<rekado>I’ll inspect

<rekado>/proc/sys/net/ipv4/

<rekado>weird: after 5 seconds the UI doesn’t update any more; next update is at 00:35 seconds; then again at 1:22; then again at 1:40

<rekado>in between the second counter does not advance at all

<rekado>after 1:40 it’s 2:06

<rekado>all this time things are happening according to the log

<civodul>hmm

<civodul>BTW, did you try (1) on a different machine on the same network, and (2) on a machine on a different network?

<rekado>no, I haven’t tried different machines yet

<rekado>I’ve got two other independent deployments of Guix here, but they don’t use this shared NFS server.

IRC channel logs

2021-04-18.log