IRC channel logs

<rlb>jcowan: looks like mem_iconveh leaves us still out of spec for truncated input, i.e. utf16->string for #vu8(120) -> "".

<rlb>but rnrs says "\ufffd"

<rlb>I suppose I could try to patch that one case back up by detecting what the count *should* have been and adding a final missing replacement myself, i.e. add one if input % 4 != 0.

<rlb>Though that's still assuming undocumented and consistent iconveh behavior, so I'll probably just leave it alone for now...

<rlb>(assumes consistent behavior for an undocumented (afaik) case)

<rlb>utf8 branch now has the utfN->string string->utfN fixes and additional tests

<apteryx>isn't this a valid SRFI 37 option declaration? https://paste.guixotic.coop/nar-herder-21067-21710.in.html passing --x-accel-redirect=no causes the option parser to error with "srfi/srfi-37.scm:110:6: In procedure args-fold: Extraneous argument after `--x-accel-redirect'"

<apteryx>but the documentation says "If OPTIONAL-ARG?, an argument will be taken if available."; maybe this only work for short options?

<apteryx>re earlier SRFI 37 question, nevermind, it works correctly now

<old>hey trick question here

<old>I have a file timestamp using stat + stat:mtime + stat:mtimensec

<old>What type I must use to create a time object with this timestamp with srfi-19

<old>MY guess would be UTC, but I am not knowledgable enough on file-system timestamp to be sure

<jcowan>Yes, filesystem timestamps are always kept in UTC. Almost all parts of the Unix tradition keep all non-UTC systems, whether TAI or LCT, at the edges only.

<old>great thanks!

<dsmith>Yep. UTC. TAI seems like it would be a better choice, so don't have to deal with leap seconds.

<old>In the context of a build-system that compare input/output timestamps, I'm not sure leap seconds would matter much

<old>it could probably in some rare cases trigger a spurious build

<dsmith>Oh it's worse.

<dsmith>I don't remember the actual numbers, but some filesystems have a huge granularity to recorded timestamps

<jcowan>Yes, technically Posix time accounts for leap seconds by doubling up on the broken-out labels.

<dsmith>Like up to 3 seconds or so?

<jcowan>Yes, and that can break 'make'.

<jcowan>Very few systems maintain nanosecond *accuracy*, especially since even TAI is only known retrospectively: it is a weighted average of a great many clocks worldwide.

<dsmith>When comparing the timestamps of a .go file and its corresponding .scm, Guile chooses the .go if the times are the same.

<jcowan>Huh. I would have made the opposite choice.

<jcowan>(compiling is safer than not compiling)

<dsmith>jcowan, I did some work implementing PTP on embedded Linux. We had devices that could synchronize to 30ns.

<dsmith>jcowan, That's the way I was leaning too.

<jcowan>OTOH, if almost all compiles take less than a second, then perhaps conservatism is the wrong choice because everything will get recompiled every time.

<dsmith>The argument was that after a build, the files might be copied somewhere with same timestamps. And the desire to not re-compile everything after that

<dsmith>But it still feels wrong to me...

<old>So, I should I stick with UTC or TAI?

<dsmith>UTC

<old>Also for BLUE, there would be a user preference for hashing file content instead of just timestamp comparison

<old>ok

<old>but I want to support both mode because hashing could be heavy weigh in some cases. Like in a CI where the environment is controlled (e.g. fresh VM), I think it might makes more sens to do timestamp instead of hashing

<jcowan>Hashing is an interesting idea, but of course it's O(n) instead of O(1)

<jcowan>Fortunately source code files are not measured in terabytes.

<old>true, but as I measured the other days

<old>on my machine, I can throughput ~260 Kib/s of sha256sum with Guile

<old>I made some changes to rnrs arithmetic fixnums and now I am at 1 Mib/s

<old>good yield, but nothing compared to sha256sum(1) of coreutils at 350 Mib/s

<jcowan>I wonder if there are incremental hash algorithms with the property that if A is a prefix of B, then hash(A) is related to hash(B) such that you don't have to start over from scratch.

<old>that would defeat the cryptographic nature of such algorithm I think

<old>I'm not an expert on that field

<jcowan>Yes, but you don't need cryptographic hashes for this purpose.

<old>it depends

<old>for local build sure, but what if we want at some point to have a distributed build-system

<old>kind of what Guix does I guess

<jcowan>For that matter, forging timestamps is ridiculously simple if you want to pretend that a compiled file is newer than a source file when it isn't.

<old>it's more toward sharing build artifacts

<old>but I guess this is more of a trusting issue anyway

<jcowan>Such a system would not be just distributed but byzantine, and I wouldn't compile my code on untrusted systems: how do you know what it's injecting?

<old>ACTION wonder how Bazel does it

<rlb>I believe I recall some discussion (maybe lwn) of having a file "generational" counter/value --- might have been partially in the context of nfs, but don't recall the details.

<rlb> https://lwn.net/Articles/975863/

<rlb>Think that may be what I remembered.

<rlb>This also seemed interesting in the build domain: https://redo.readthedocs.io/en/latest/

<old>there's n2 also that's interesting

<old>ninja successor

<ieure>People need to stop making new build systems.

<jcowan>Why? That's like saying people need to stop making new programming languages, or DEs, or editors.

<jcowan>"Let a hundred flowers bloom; let a thousand schools of thought contend."

<jcowan>Of course Mao didn't mean it, but we do.

<rlb>mtime vs mmap behavior also "varies" --- e.g. see "Does writing to a file via mmap() update the mtime?" here: https://apenwarr.ca/log/20181113

<rlb>(It also talks about how that flavor of redo mitigates the timestamp issues lower down.)

<rlb>old: that might or might not be relevant to your situation (i.e. the additional data/logic it uses to decide --- in the next section there, and looks like it talks about bazel a bit too).

<old>rlb: right virtual mapping of file is a mess with timestamp

<old>oh ya that blog post, I have it pinned down somewhere

<rlb>...redo's approach overall seemed interesting, but I've not looked at it in detail.

<old>I guess the interesting bits are here: https://grosskurth.ca/papers/mmath-thesis.pdf

<rlb>There's also some summary info/overview in that version of redo's docs somewhere, I think.

<rlb>Part of it is just the way it learns the finer grain deps as they're revealed, and just explicitly tracks the state (in sqlite last time I looked, maybe?). Don't recall --- been a good while since I poked at it.

<jcowan>rlb: did we talk about noncharacter-based error recovery while encoding and decoding? I forget.

<rlb>I don't recall that specifically, but perhaps.

<ieure>jcowan, Too dang many of them, too many features (downloading deps is IMO *not* a build tool problem), too much fragmentation means you're often learning a new build system to do anything on a new project, even if you already know the language. Build system affects how *everyone* uses the software, DEs/editors do not.

<jcowan>I find that I change build files far less often than code files.

<jcowan>If I'm working on a greenfield project, then I need to make a choice and create a build file

<jcowan>rlb: It's about dealing with dirty UTF-n data that comes in from pathnames, environment variables, etc. in a way that doesn't penalize people for the rare cases when it actually is dirty.

<jcowan>The three standard error recovery modes all have problems: they discard information. The alternatives are to allow dirty data to infect your strings (contrary to R[67]RS)

<rlb>Are you just talking about having noncharacters as another "error handler" --- if so, then I'd assumed that'd be part of the approach *if* we decide to go some noncharacters-involving route.

<jcowan>or else to have a twofold system, where you might get a bytevector instead of a string and have to handle it.

<jcowan>rlb: Yes

<rlb>That's also what python does(ish).

<rlb>e.g. errors="surrogateescape", etc.

<jcowan>Except that works because Python strings are allowed to contain unpaired surrogates, unlike Scheme strings.

<jcowan>Anyway, there's a writeup of the noncharacter approach at https://codeberg.org/scheme/r7rs/wiki/Noncharacter-error-handling.md that I thought would interest you.

<rlb>I didn't mean their specific approach, we'd use noncharacters, of course.

<jcowan>Good.

<rlb>Just the general approach of making it available (at least) as one of the recovery strategies

<rlb>If we do go this route, I'm now leaning toward the idea that maybe we distinguish "kinds"/sources of data and don't have a blanket policy, e.g. perhaps apply noncharacters to say paths, env vars and argv by default, but not port content.

<jcowan>The two basic ideas are to encode a byte XY as U+FDDX U+FDDY and to quote incoming noncharacters with U+FFFE.

<rlb>That might also be a practical approach wrt the liklihood of any "data smuggling" risks we've discussed --- still covering the important/common cases.

<rlb>Right, I know the strategy fairly well --- have part of it already implemented in a branch (the enc/dec part) here.

<jcowan>Good.

<jcowan>Corruption in UTF-8 files is not uncommon, though.

<rlb>Though I haven't implemented the newer \s bit.

<jcowan>ACTION nods

<jcowan>I'm not sure how useful that is.

<rlb>Sure, though I'm thinking that for file content, we perhaps default to "error", and you need to ask for anything else.

<rlb>file/network/etc. --- i.e. port content.

<rlb>But for paths, etc. you get noncharacters by default, so that they "round-trip" without any special effort or need to know about all this mess.

<jcowan>When Plan 9 was converted to UTF-8 throughout (in one day!) they originally went with error mode, but found that it required too much ceremony.

<jcowan>"Originally the conversion routines, described

<jcowan>below, returned errors when given invalid UTF, but we found ourselves repeatedly checking for errors and

<jcowan>ignoring them. We therefore decided to convert a bad sequence to a valid rune and continue processing."

<jcowan>They used U+0080 instead of U+FFFD, on the grounds that distinguishing the uncodable from the undecodable made sense.

<rlb>I'm currently thinking we should consider switching ports to error by default in one of our forthcoming compatibility-breaking releases (perhaps with an easy way to opt-out --- i.e. change the default).

<jcowan>Does Guile use character buffers as well as byte buffers when reading from textual ports?

<rlb>I know our convention in the C code is to have a space before function call aruments; is that also the case when we're selecting a field in a returned struct? e.g.

<rlb> x = y + some_thing (z).some_field;

<rlb>(I'll assume so for now.)

<rlb>(Just looks a bit odd.)

<old>always found that C style .. weird

<rlb>Not what I'm used to either.

<identity>like writing 2 + 2 instead of 2+2

<rlb>But I've been writing so much of it, I at least don't have to correct myself all the time anymore :)

<rlb>(in libguile)

<ekaitz>rlb: https://www.gnu.org/prep/standards/html_node/Writing-C.html

<ekaitz>maybe it doesn't explain that specific case

<jcowan>It doesn't *explain* at all. It just dictates, which means it is no more than a prejudice, and a weird one at that.

<jcowan>Nobody but GNU writes function calls in any context as foo (x).

<jcowan>The arrogance of "We find it easier to read a program when it has spaces before the open-parentheses" is stunning. Who is "we"?

<identity>«you people use spaces in your programs?»

<identity>i tried one of my small programs, the ratio of other characters to spaces is close to 32… i would guess that is just me, though

<probie>jcowan: Different language, but the most commonly used style in Ada also adds an unneeded space between function/procedure calls and their aguments

<dsmith>Ada also uses () for arrays. Weird.

<Arsen>20:29:23 <jcowan> The arrogance of "We find it easier to read a program when it has spaces before the open-parentheses" is stunning. Who is "we"?

<Arsen>it's a royal "we". fairly usual - any document that sets any standard for any project uses the word "we"

<Arsen>> In certain structures which are visible to userspace, we cannot require C99 types and cannot use the u32 form above. Thus, we use __u32 and similar types in all structures which are shared with userspace.

<Arsen>another example, from linux

<probie>If we ignore the actual implementation on metal, what is an array except a function from index to value?

<Arsen>probie: their set of indices is also contiguous ;)

<rlb>In clojure vectors *are* also functions, and so are maps, and sets...

<rlb>user=> ([1 2 3] 1)

<rlb>2

<Arsen>20:25:51 <jcowan> It doesn't *explain* at all. It just dictates, which means it is no more than a prejudice, and a weird one at that.

<Arsen>(foo bar baz) => foo (bar, baz) <- there's the explanation

<rlb>commonly used for filtering, etc. (filter some-set some-collection)

<Arsen>in fact that also explains the positioning of curlies under flow control statements such as if (consider progn/begin)

<rlb>(filter #{42} [1 2 3 ...])

<Arsen>re the question on 'foo ().bar'; yes, the space is correct. example from gcc: get_global_range_query ()->range_of_expr (r, op0, stmt);

<Arsen>here's one from emacs: print_string (BVAR (XBUFFER (XWINDOW (obj)->contents), name),

<dsmith>ACTION shudders

<Arsen>... and from libguile ;) #define SCM_HASHTABLE_N_ITEMS(x) (SCM_HASHTABLE (x)->n_items)

<mwette>I also hate the space.

<ekaitz>jcowan: Do you know *everybody*?

<identity>probably

<ekaitz>i found it surprising at the beginning, and now I started to like it

<ekaitz>maybe only GNU does this, but they have been doing it for 40 years so I respect that

<ekaitz>it's not like I'm going to be a guest in your home and don't listen to your habits

<ekaitz>if you like to take your shoes off, i do

<ekaitz>i don't think having a preference is arrogance

<ekaitz>(the half indentation is weirder, and nobody seem to complain about it)

<dsmith>*I* do!

<dsmith>(Not really. I've got to the "whatever" point a long time ago)

<ekaitz>:)

<dsmith>I learned C from K&R, and to me that's what C code should look like.

<ekaitz>ACTION is so dumb he doesn't have very strong opinons about anything

<ekaitz>ACTION thinks he is getting dumber

<dsmith>It's really what you are used to. I really hated how "noisy" Rust looked after previously learnign Go.

<dsmith>But you just get used to it, and then it's ok.

<ekaitz>dsmith: wait! but doesn't let me complain and be judgemental about other people! Unacceptable!

<ekaitz>also, it requires some effort from my side! doubly unacceptable!

<ekaitz>it must be somebody else's problem! they are *so arrogant*!

<dsmith>The default emacs C-mode does gnu by default. As does the indent prog

<dsmith>A nice thing about Rust and Go is there is an enforced layout. And so no pointless trivial arguments.

<rlb>...I kinda want a "remember" variant like scm_remember_and_return_1 (remembered, return) for cases like scm_remember_and_return_1 (stringbuf, u8[i]) where u8 is the internal content of stringbuf.

<rlb>Avoids a bunch of otherwise unnecessary 4 line blocks.

<rlb>(no idea what it should be named, if it's even plausible)

<rlb>e.g. you can just "if (foo) scm_remember_and_return_1 (buf, u8[i]);"

<rlb>though of course that particular case doesn't work --- it'd only work (easily) for SCM return values.

<rlb>unless "macro"

<rlb>OK, think all the likely string_refs are gone from libguile in utf8 --- i.e. the simplicity of the remaining calls may be worth the minimal extra cost, but can always revisit later. Well, there's still the use in array-handle.c that I haven't investigated, but offhand, I'm guessing it's not something that could be easily changed.

<rlb>I'll push an update later.

IRC channel logs

2026-04-17.log